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Abstract 

We analyze the statistical consistency of robust estimators for precision matrices in high 
dimensions. We focus on a contamination mechanism acting cellwise on the data matrix. The 
estimators we analyze are formed by plugging appropriately chosen robust covariance matrix 
estimators into the graphical Lasso and CLIME. Such estimators were recently proposed in the 
robust statistics literature, but only analyzed mathematically from the point of view of the 
breakdown point. This paper provides complementary high-dimensional error bounds for the 
precision matrix estimators that reveal the interplay between the dimensionality of the problem 
and the degree of contamination permitted in the observed distribution. We also show that al¬ 
though the graphical Lasso and CLIME estimators perform equally well from the point of view 
of statistical consistency, the breakdown property of the graphical Lasso is superior to that of 
CLIME. We discuss implications of our work for problems involving graphical model estimation 
when the uncontaminated data follow a multivariate normal distribution, and the goal is to 
estimate the support of the population-level precision matrix. Our error bounds do not make 
any assumptions about the the contaminating distribution and allow for a nonvanishing fraction 
of cellwise contamination. 

Keywords: Robust covariance estimation, cellwise contamination, Kendall’s tau, Spearman’s 
rho, median absolute deviation. 


1 Introduction 

Covariance matrix estimation has long taken center stage in multivariate analysis (Anderson, 2003). 
The sample covariance estimator, which originates as the maximum likelihood estimator under 
a multivariate normal model, is optimal in many respects: It is unbiased, consistent, efficient 
under various distributional assumptions, and easy to compute. Despite its many positive traits, 
however, the sample covariance matrix is also highly non-robust when data are observed subject 
to contamination. Hence, various procedures in robust statistics have been derived to obtain a 
covariance matrix estimator that behaves well even in the presence of contaminated data (Huber, 
1981; Hampel et ah, 2011). 

In other areas of multivariate analysis, the precision matrix Q* := (51*)“^ is of significant 
interest. Examples include computing Mahalanobis distances, linear discriminant analysis, and 
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Gaussian graphical models. In the setting of graphical models, a random vector X is associated 
with an undirected graph G = {V, E) that encodes the conditional independence relations between 
components of X (Lauritzen, 1996). The vertex set V contains Xi,... ,Xp, while the edge set E 
consists of pairs where (i,j) £ E if Xi and Xj are connected by an edge. For each non¬ 

edge {i,j) 0 -E', the variables Xi and Xj are conditionally independent given all other variables 
X\{ij} := y\{Xj,Xj}. When X ~ X*), pairwise conditional independence holds if and only 

if fl*j = 0. Thus, recovering the support of the precision matrix is equivalent to graphical model 
selection. The aforementioned observations have been used for network reconstruction in many 
scientific fields, including genetics and neuroscience (e.g., see Werhli et al. (2006); Smith et al. 
(2011), and the references cited therein). When the dimensionality p is small compared to the 
number of samples n, a reasonable method for robust precision matrix estimation could consist of 
computing a robust estimate of the covariance matrix and then taking a matrix inverse. 

With the recent deluge of high-dimensional data, however, a need has arisen to obtain high¬ 
dimensional analogs of classical statistical procedures that are both computable and possess rigorous 
theoretical guarantees. Although several methods, notably the graphical Lasso (GLasso) (Yuan and 
Lin, 2007; Friedman et ah, 2008) and the method of constrained hi-minimization for inverse matrix 
estimation (GLIME) (Cai et ah, 2011), have been proposed for high-dimensional precision matrix 
estimation, robust estimation of high-dimensional precision matrices has only recently emerged 
in the literature. The GLasso and CLIME estimators themselves tend to perform poorly under 
contaminated data, since they take as input the sample covariance matrix that is sensitive to even 
a single outlier. 

Popular classical robust covariance estimators are applicable in settings where less than half of 
the observation vectors are contaminated. Such assumption is closely connected to the Tukey-Huber 
contamination model that underlies much of the existing robustness theory (Tukey, 1962; Huber, 
1964). In the Tukey-Huber contamination model, a mixture distribution with a dominant nominal 
component (such as a multivariate normal distribution) and a minority unspecified component are 
posited, and each observation vector is either completely clean or completely spoiled. Classical 
robust covariance estimators then involve downweighting contaminated observations in order to 
reduce their influence. When the dimension p is large, however, the fraction of perfectly observed 
data vectors may be rather small: If all components of an observation vector had an independent 
chance of being contaminated, most observation vectors would be contaminated. Thus, down¬ 
weighting an entire observation would waste the information contained in the clean components 
of the observation vector. This describes the setting of the cellwise contamination model, which 
was developed by Alqallaf et al. (2002). It generalizes the classical Tukey-Huber contamination 
model, which may be viewed as a case of rowwise contamination of the data matrix, and is fairly 
realistic for applications involving measurement error in DNA microarray analysis (Troyanskaya 
et ah, 2001) or dropout measurements in sensor arrays (Swanson, 2000). 

On the other hand, most existing approaches for robust covariance estimation focus on affine 
equivariance. These include the M-estimators (Maronna, 1976), Minimum Volume Ellipsoid (MVE) 
and Minimum Govariance Determinant (MOD) estimators (Rousseeuw, 1984, 1985), and the Stahel- 
Donoho (SD) estimator (Stahel, 1981; Donoho, 1982). Although affine equivariance may be a 
desirable property under rowwise contamination, it is less appropriate in the setting of cellwise 
contamination, since linear combinations of observation vectors lead to a propagation of outliers 
(Alciallaf et ah, 2009). In addition, the MVE, MCD, and SD estimators all require heavy com¬ 
putational effort, rendering them impractical for high-dimensional datasets. To deal with cellwise 
contamination. Van Aelst (2014) proposed a modified SD estimator that adapts winsorization 
(Huber, 1981; Alqallaf et ah, 2002) and a cellwise weighting scheme. Similar to the original SD 
estimator, however, computation is only feasible for small p. A recent approach by Agostinelli 
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et al. (2014) is capable of dealing with both rowwise and cellwise outliers. The procedure consists 
of two steps: (1) flag cellwise outliers as missing values; and (2) apply a rowwise robust method to 
the incomplete data. However, computation is again infeasible in high dimensions. Other recent 
proposals for robust high-dimensional covariance matrix estimation include Chen et al. (2015) and 
Han et al. (2015), but both methods treat different contamination models and are not suitable to 
handle data with cellwise contamination. 

In contrast, relatively few approaches exist for robust high-dimensional precision matrix esti¬ 
mation under any form of contamination. One method is supplied by the TLasso estimator of 
Finegold and Drton (2011), which builds upon the GLasso and models the data as coming from the 
multivariate t-distribution, a long-tailed surrogate for the multivariate normal distribution. The 
“alternative multivariate t-distribution” is used to model a case where different coordinates of the 
distribution are obtained from the latent multivariate normal distribution using different weights. 
Although the TLasso demonstrates a higher degree of robustness than the GLasso under both 
rowwise and cellwise contamination in simulations, however, a theoretical analysis from the point 
of view of robust statistics has not been derived. More recently, Oellerer and Croux (2014) and 
Tarr et al. (2015) propose a promising new method for high-dimensional precision matrix estima¬ 
tion, designed specifically for cellwise contamination. The method consists of combining a robust 
covariance estimator that may be computed efficiently with a suitable high-dimensional precision 
matrix estimation procedure. Whereas Tarr et al. (2015) focus on developing new methodology 
and Oellerer and Groux (2014) analyze breakdown behavior of the precision matrix estimators, 
however, a rigorous high-dimensional analysis from the point of view of statistical consistency has 
not been conducted. 

In this paper, we focus on high-dimensional robust estimation of precision matrices under the 
cellwise contamination model, using the estimators proposed by Oellerer and Groux (2014) and 
Tarr et al. (2015). Formally, we derive statistical error bounds in elementwise 1 ' 00 -norm for ro¬ 
bust precision matrix estimation procedures under an e-contamination model, where at most an 
e fraction of entries in the data matrix are corrupted by outliers. Our work fuses two threads 
of research involving classical robust procedures and high-dimensional statistical estimation in a 
novel and rigorous manner. The bounds we derive match standard high-dimensional bounds for 
uncontaminated precision matrix estimation, up to a constant multiple of e. Furthermore, they 
are of a complementary nature to Oellerer and Groux (2014), since we are primarily concerned 
with robustness as measured from the viewpoint of statistical consistency, rather than breakdown 
behavior. 

More generally, our results reveal an interesting interplay between bounds for statistical error 
under e-contamination and classical measures of robustness such as the influence function (Hampel, 
1974) and breakdown point (Donoho and Huber, 1983). Estimators with bounded influence have 
long been favored in classical robust statistics, as the rate of change in the statistical functional 
associated with the estimator is controlled when the nominal distribution is contaminated by an 
arbitrary point mass distribution. Our results show that a variety of bounded influence estimators, 
including Kendall’s and Spearman’s correlation coefficients, give rise to (inverse) covariance estima¬ 
tors with statistical error rates that depend linearly on the degree of contamination; the converse 
relationship may be seen to hold more generally as a result of our proof arguments. On the other 
hand, our discussion of the breakdown point of the precision matrix estimators, building upon the 
analysis of Oellerer and Groux (2014), emphasizes the significant differences between the notions of 
breakdown point and statistical consistency. Whereas our analysis shows that the robust GLIME 
and GLasso procedures have comparable behavior from the point of view of high-dimensional sta¬ 
tistical consistency, the GLIME estimator has a substantially smaller breakdown point than the 
GLasso, due to its constrained feasibility region. Rather than advocating one measure of robustness 
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over another, our discussion emphasizes the value of considering different quantitative measures of 
robustness in selecting an appropriate estimator. 

The remainder of the paper is organized as follows: Section 2 furnishes the mathematical 
background for the cellwise contamination model and the robust covariance and precision matrix 
estimators to be considered in the paper. Section 3 presents our main theoretical contributions, 
providing bounds on the statistical error of the covariance and precision matrix estimators under 
the cellwise contamination model, as well as concrete consequences in the presence of outliers 
and/or missing data. Section 4 provides a discussion of the breakdown point for the robust GLasso 
and CLIME estimators. In Section 5, we discuss the main steps of the proofs of our theorems. 
Section 6 contains simulation results that are used to validate the theoretical results of the paper. 
We conclude with a discussion in Section 7, including some avenues for future research. 


Notation: For a vector a = (ai,...,Op)^ 
the ^i-norm and .^ 2 -iiorm of f 
,hi 

jj=l 

sup|| 3 ;||<i ||Ax|| 2 , and the matrix t'l-norm 


G M^, we denote by ||a||i = l“*l ll^lb = 

respectively. For a matrix A = {aij) G we 

define the elementwise £oo-norm ||A||oo = maxi<j<p i<j<q |aij|, the elementwise £i-norm ||A||i = 
the Frobenius norm HAHi? = (Ef=i Ej=ispectral norm ||A ||2 = 

A||ij = maxi<j<g kb'l- We use Ai(A) > A 2 (A) > 
> Ap(A) to denote the ordered eigenvalues of A, and we write A 0 (respectively, A ^ 0) to 
indicate that A is positive definite (respectively, positive semidefinite). We write I for the identity 
matrix and 0 for the vector of all zeros (the respective dimension of which will be clear from 
context). The binary operation (g) denotes the tensor product. 


2 Background and Problem Setup 

We begin with a description of the cellwise contamination model, followed by a rigorous formulation 
of the robust covariance and precision matrix estimators to be studied in our paper. 

Following the notation of Alqallaf et al. (2002, 2009), we write the cellwise contamination model 
in the following form: 

Xfc = (I - Bfc)Yfe + BfcZfc, VA: = l,...,n. (1) 

Here, we observe the contaminated random vector G The unobservable random vectors 
Yfc, Zfc, and B^ are independent, and Y^ ~ G (a nominal distribution) and Z^ ~ H* (an unspecified 
outlier generating distribution). Furthermore, B^ = diag(Bfci,..., B^p) is a diagonal matrix, where 
Bki ,..., Bkp are independent Bernoulli random variables with P{Bki = 1) = e*, for all 1 < z < p. 

When ei = • • • = Cp = e, the probability of an observation vector having no contamination 
in any component is (1 — e)^, a quantity that decreases exponentially as the dimension increases. 
This probability goes below the critical value 1/2 for p > 14 at e = 0.05, and for p > 69 at 
e = 0.01. Equation (1) is a special case of a more general model, where we allow other joint 
distributions for Bki,...,Bkp. For instance, if Bki,...,Bkp were completely dependent (i.e., 
P{Bki = ■ ■ ■ = Bkp) = 1), we would obtain the rowwise contamination model. In that case, the 
probability of an observation vector being totally free of contamination would be 1 — e, which is 
independent of the dimension. Alqallaf et al. (2009) also uses the terms fully independent contam¬ 
ination model (FICM) and fully dependent contamination model (FDCM) to denote the cellwise 
and rowwise contamination settings, in order to distinguish the pattern of contamination across 
rows of the data matrix. 

Throughout, we will work under the cellwise contamination model (1), and assume that G 
is a multivariate normal distribution A(/x, 51*). Our goal is to estimate the matrices Xl* and 
ri* = (S*)“^ from the (uncontaminated) normal component. 
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2.1 Covariance Matrix Estimation 


Note that when e = 0 (i.e., the data are uncontaminated), we may use the classical sample covariance 
matrix estimator 5], defined pairwise as 

1 ” 

^ - Xi){Xkj - Xj), VI <i,j < p, 

^ k=l 

where Xi = {l/n)J2k=i^ki and Xj = (1/?^) X]fc=i When n ^ p, the sample covariance is 
an efficient estimator for 51*. However, when e > 0, the performance of S may be compromised 
depending on the properties of H*: Under the cellwise contamination model, for i ^ j, we have 

= (1 - ei){l - ej) (S^),. + 6.e,- (S*^),^- 

= ~ ^i^j) i'^*z)ij ■ 

When no restrictions are placed on the covariance of the contaminating distribution, the ele¬ 
mentwise deviations between 513^ and Sy (and consequently, also the sample covariance 51x := 51 
and 51y) will in general behave arbitrary badly. Furthermore, note that even when 51^ is con¬ 
strained to lie in a space where the deviations between 513^ and 51y are suitably bounded, we 
would require the contaminating distribution to have properties such as sub-Gaussian tails in order 

to ensure consistency of the sample covariance estimator on the order of O . When a pro¬ 

cedure based on covariance estimation is used to estimate the precision matrix, the errors incurred 
during the covariance estimation step would propagate to the next step. For instance, this issue 
would arise in using the CLIME or GLasso estimator. In contrast, our theory for robust covariance 
estimators will not require any assumptions on either 51^ or the tail behavior of the contaminating 
distribution. 

To deal with cellwise contamination in the high-dimensional setting, we therefore take the 
pairwise approach suggested by Oellerer and Croux (2014), where a robust covariance or correlation 
estimate is computed for each pair of variables. Early proposals of robust procedures are of this 
type (Bickel, 1964; Puri and Sen, 1971), where a coordinatewise approach is taken for robust 
estimation of location. In addition to having relatively low computational complexity, the pairwise 
approach is appealing because a high breakdown point of the pairwise estimators translates into a 
high breakdown point of the overall covariance matrix. Eor 1 < i, j < p, we write 

~ (2) 

where Uj = [Var(Xfcj)]^/^, aj = [Var(Xfcj)]^/^, and = CoTr{Xki, X^j). We will take suitable 
robust estimators of dj, dj, and to obtain the covariance matrix estimator 5], with (i, j) entry 

To estimate ai, we consider the median absolute deviation from the median (MAD), a robust 
measure of scale. The MAD estimator was popularized by Hampel (1974), who attributes the 
concept to Gauss. It has a breakdown point of 50%. Let j < • • • < j denote the ordered 
values of Xu,... ,Xni- The sample median m* and the sample MAD di are defined, respectively, 
as 

771 j j, and di 

where Wki = \Xki — rhil, for all A: = 1,..., n, and k* = [ 77 / 2 ]. Expressed another way, 


di 


median 

l<fc<n 



median(X£4 

l<£<n 


(3) 


5 






We then estimate ai by a* = [<h“^(0.75)]“^cii, where the constant [<h“^(0.75)]“^ is chosen in order 
to make the estimator consistent for ai at normal distribntion. The population-level median of a 
distribution with cdf F is defined to be m{F) := F~^ (0.5), where F~^{c) = inf{x : F{x) > c}, 
for c G [0,1]. Similarly, we may define the population-level MAD d{F) to be the median of the 
distribution of \X — m{F)\, where X has cdf F. 

To estimate p^j, we consider the classical nonparametric correlation estimators, Kendall’s tan 
and Spearman’s rho: 


Kendall’s tan: This statistic is given by 

2 

“ Xei)sign{Xkj - X^j), 

' k<e 

where sign(A) = 1 if A > 0, sign(A) = —1 if A < 0, and sign(O) = 0. 


(4) 


Spearman’s rho: This statistic is given by 

^ ELi[^ank(Afci) -{n + l)/2][rank(Afcj) - {n + l)/2] 

\/ELi[rank(Afci) - {n + l)/2f ELi[i’ank(Afcj) -{n + l)/2f ’ 
where rank(Afcj) denotes the rank of A^j among Xu ,..., A^j. 

The population versions of the estimators are given, respectively, by 

pfj = K[sign(Aij - A 2 j)sign(Aij - X 2 j)], (6a) 

pfj = 3K[sign(Aii - A 2 i)sign(Aij - Xsj)]. (6b) 

When ei = • • • = Cp = 0, we have ~ A(/x, S*); in this case, it is known that (Kendall, 1948; 
Kruskal, 1958) 

Pij=sm(^pfj') =2sin(^pg) . 

Hence, for asymptotic consistency at normal distribution, our estimator for p^j is the transformed 
version of Kendall’s tan and Spearman’s rho, given by sin(|r^) and 2sin(|r^), respectively. We 
then define as S our robust covariance matrix estimator, with 

'^ij = sin j , and = 2&idj sin . (7) 

2.2 Precision Matrix Estimation 

A long line of literature exists for precision matrix estimation in the high-dimensional setting. We 
will focus our attention on sparse precision matrix estimation; i.e., ft* contains many zero entries. 
In this section, we review two techniques, the GLasso and CLIME, which produce a sparse precision 
matrix estimator based on optimizing a function of the sample covariance matrix. As proposed 
by Oellerer and Croux (2014) and Tarr et al. (2015), these methods may easily be modified to 
obtained robust versions, where the sample covariance matrix estimator is simply replaced by a 
robust covariance estimator X as described in the previous section. 

The graphical lasso (GLasso) estimator (Yuan and Lin, 2007; Pidedman et ah, 2008) is defined 
as the maximizer of the following f'l-penalized log-likelihood function: 

Q = argmin|tr(Sr2) — logdet(r2) -|- A||S7||i}. 
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Here, A > 0 is a tuning parameter that controls the sparsity of the resulting precision matrix 
estimator. 

In this paper, we replace the sample covariance matrix 'S by the robust alternative 5], and 
consider a variant where only the off-diagonal entries of the estimator are penalized: 

fl = argmin |tr(Sri) — logdet(r2) -|- A||ri||i^ofr}- (8) 

Note that although the program (8) is convex for any choice of 5] G several state-of-the-art 

algorithms for optimizing the GLasso require the matrix S to be positive semidefinite (Friedman 
et ah, 2008; Zhao et ah, 2012; Hsieh et ah, 2011). We will first derive statistical theory for the 
robust GLasso without a positive semidefinite projection step, and then discuss properties of the 
projected version in Section 4. 

A popular alternative to the GLasso is the method of constrained .^i-minimization for inverse 
matrix estimation (CLIME) proposed in Cai et ah (2011). The CLIME routine solves the following 
convex optimization problem by linear programming: 

Li = argmin ||fi||i subject to ||E)r2 — I||oo < A. 

neRpxp 

Note that here, no symmetry condition is imposed on fl, and the solution is not symmetric in 
general. If a symmetric precision matrix estimate is desired, we may perform a post-symmetrization 
step on to obtain the symmetric matrix fisym, defined by 

^sym = (wij), where 

Uij = Uji = < |w]i|) + > |w)i|). (9) 

In other words, between ujjj and we pick the entry with smaller magnitude. Similar to the 
GLasso case, we will robustify the GLIME estimator by solving 

= argmin ||ri||i subject to ||5]r2 — I||oo < A, (10) 

and then apply the post-symmetrization step (9) to fl to obtain the final robust CLIME estimator 

^syrti' 

3 Main Results and Consequences 

In this section, we provide rigorous statements of the main results of the paper. We begin by 
deriving bounds for robust covariance matrix estimation, which are used to obtain bounds on the 
error incurred by the precision matrix estimator. Note, however, that the statistical error bounds 
presented in Section 3.1 are of independent interest, and we believe they are the first bounds 
appearing in the literature that quantify the robustness of covariance matrix estimators under a 
cellwise contamination model. 

3.1 Covariance Matrix Estimation 

Throughout this section, we will assume that the standard deviations of the uncontaminated dis¬ 
tributions are bounded as follows: 


0 < min Uj < max ai < M^. 

l<i<p 


( 11 ) 
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We also define the expression 


c(cTi) 


15 / (l.lT. + O.Sjn 

64VSf<r. 2al i' 


VI < f < p. 


Our first theorem provides a bound on the statistical error of the robust covariance estimator 
S based on Kendall’s tau correlations. Note that our result does not involve any assumptions 
on the nature of the contaminating distribution H. Thus, the distribution H may contain point 
masses, and we do not require a probability density function of H to even exist. 


Theorem 1. Under the cellwise eontamination model (1), suppose inequality (11) is satisfied, and 


e = maxi<j<p e* < 0.02. Let C > T^y/2 and C > — 


$-1(0.75) mini<i<p c{(Ti)^’ 


and suppose 


max 



+ 20776 , 



+ 7.2M„ 


< 1 , 


( 12 ) 


and < 1. 


Then with probability at least 


I — 2p h0.75)]2C'2mini<i<pc2(o-i)-l}^ 


the robust covariance estimator satisfies 


— S’* 


< {C{M^ + M^ + 1) + C'{2M„ + 1)) + (97M2 + 89M, + 82) e. (13) 


The proof of Theorem 1 is provided in Section 5.1. 


Remark 1. Theorem 1 clearly illustrates the efifeet of e-contamination on the estimation error 

of the covarianee matrix estimator. Note that when e = 0, we recover the minimax optimal rate 

^ K 

for covariance matrix estimation in i^o-norm (Cai and Zhou, 2012); although the estimator X) 
is not equal to the sample covariance estimator in the uncontaminated case, the robust covariance 
estimator nonetheless converges to the true covariance matrix at the optimal rate. On the other 
hand, cellwise contamination introduces an extra term that is linear in e. 

Another way to interpret the bound (13) is that if the level of eontamination e is bounded by 

a constant times then the robust covariance estimator Cl will enjoy the same statistical 

error rate as the optimal covariance estimator in the uneontaminated ease. As we will see in 
Theorems 3 and j below, the sample size requirements for precision matrix estimation are such that 

the condition e < still allows for a nonvanishing fraction of contamination. Furthermore, 

note that although the restriction e < 0.02 may seem somewhat prohibitive, the proof of Theorem 1 
reveals that the specifie bound on e is an artifaet of the proof technique, and a more eareful analysis 
would allow for a larger degree of eontamination, at the expense of slightly looser constants in the 
covariance estimation bound (13), as long as e is bounded by some constant in [0,1). 

The following theorem is an analog of Theorem 1, derived for the robust covariance estimator 
- S 

X) based on Spearman’s correlation coefficient. We assume that the ranks of variables between 
samples are distinct; note that this happens almost surely when the contaminating distribution has 
continuous density. 


















Theorem 2. Under the cellwise contamination model (1), suppose the variable ranks are dis¬ 
tinct. Also suppose inequality (11) is satisfied and e = maxi<j<pei < 0.01. Let C > 8 tt and 


C > ^ ^and suppose 

#-l(0.75)mini<i<pc(cri)V2 


max 


5C 



+ 5l7re, C 



+ 7.2M„€ 


< 1, 


and the sample size satisfies ^~^{0.75)C'< 1 and n > max|l5, Then with proba¬ 

bility at least 

I _ — 6p-{2[‘J’"n0-75)]2C'2mini<i<pc2(cri)-l}^ 

the robust covariance estimator satisfies 


E - E" 


< 


oo 


(^{M^ + M^ + l) + C'{2M, + 1)^ + (175M2 


4- ms A/f 




The proof of Theorem 2 is provided in Section 5.2. 

Remark 2. The conclusion of Theorem 2 is very similar to that of Theorem 1, except for constants 
and an additional requirement on the size of n. However, note that when = o{l), implying the 

statistical consistency of the robust covariance estimator, the requirement n > max 115, is 

essentially extraneous. 

Althongh the high-dimensional error bounds derived in Theorems 1 and 2 are substantially 
different from the canonical measures analyzed in the robust statistics literature, our bounds are 
somewhat related to the notion of the influence function of an estimator. The influence function 
(Hampel, 1974), defined at the population level, measures the infinitesimal change incurred by the 
statistical functional associated with an estimator when the underlying distribution is contaminated 
by a point mass. Thus, an estimator has a bounded influence function if the extent of the deviation 
in its functional representation due to contamination remains bounded, regardless of the location 
of the point mass. The error bounds (13) and (14) also reveal that the extent to which the error 
deviation between the robust covariance estimator and the true covariance grows is bounded by 
a constant depending only on M^. The two notions do not match precisely; for instance, our 
theorems allow contamination by an arbitrary distribution rather than simply a point mass, and 
we are comparing finite-sample deviations of an estimator from S* rather than population-level 
deviations of a statistical functional under a contaminated distribution. However, note that by 
sending n —>■ oo in the finite-sample bounds and taking the contaminating distribution to be a point 
mass, we may conclude that the influence function of the robust covariance estimator is bounded. 
Furthermore, the arguments in our proofs (cf. Lemmas 12 and 13 in Appendix C) may be used to 
derive the fact that the corresponding correlation estimators have a bounded influence function, 
the precise forms of which appear in Croux and Dehon (2010). The reverse implication, that a 
correlation estimator with bounded influence (together with a bounded-influence scale estimator) 
gives rise to high-dimensional deviation bounds of the form in inequalities (13) and (14), seems 
natural but is not immediate. 

Finally, note that although Theorems 1 and 2 have been derived under the assumption that 
the uncontaminated data are drawn from a normal distribution, the same proof techniques may be 
applied to analyze settings where the uncontaminated data are drawn from a different underlying 
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distribution, as long as the uncontaminated distribution is suitably well-behaved (e.g., has sub- 
Gaussian tails). Since our ultimate goal is precision matrix estimation, we have focused only on 
the scenario where the uncontaminated data are drawn from a Gaussian distribution, in which case 
the structure of the precision matrix is of great interest in the statistical community. 


Extensions. Similar high-dimensional error bounds could be derived for the robust covariance 
estimator based on the quadrant correlation estimator, which is given by 


Q ^ 
ij 


1 

n 


n 

^sign 

k=l 


Xki - median 

l<l<n ) 


sign 



median 

\<l<n 


and also known to have bounded influence (Shevlyakov and Vilchevski, 2002). However, we do not 
provide the full derivations here, since they follow from similar arguments to the ones used in the 
case of Kendall’s and Spearman’s correlations. 

We also comment briefly on another pairwise covariance estimator appearing in the robust 
statistics literature. Tarr et al. (2015) and Oellerer and Croux (2014) propose to use the following 
estimator based on an idea of Gnanadesikan and Kettenring (1972): Noting that 

Cov(X, Y) = ^ [Var(aX + f5Y) - Var(aX - ^Y )], 


where a = 1 /Y^Var(X) and (3 = 1/ Y^Var(y), the proposal is to replace the variance estimator by a 
robust variance estimator (e.g., the square of the MAD estimator). However, the drawback of this 
estimator in comparison to the covariance estimators based on Kendall’s tau and Spearman’s rho is 
that the covariance estimator has a maximal breakdown point of 25% under cellwise contamination, 
since the argument in the variance involves a sum of variables, and any robust variance estimator 
has a maximal breakdown point of 50%. We remark that from the point of view of statistical 
consistency, a version of the Gnanadesikan-Kettenring covariance estimator may be analyzed in the 
same manner as the above estimators. Indeed, if we consider the covariance estimator 


1 

4 




(15) 


where is the (rescaled) MAD statistic computed from {X^i + X^j : 1 < A: < n}, and 

is analogously defined to be the MAD statistic computed from {Xki — Xkj ■ 1 < k < n}, our 
derivations showing the consistency of the MAD estimator (cf. Lemmas 10 and 11, with minor 
modifications) show that 




<Gi 



+ ^ 26 , 


and 


max 






<Gi 



+ C2G, 


for data from the cellwise contamination model, where iT(j _|_ and cj(jj) _ are the population-level 
standard deviations of the distributions of Xki + Xkj and Xki — Xkj, respectively. Thus, 


1<2 j'<p 




r/ /lOgP 


+ C"e, 


n 


as well, from which we may conclude that the covariance estimator (15) deviates from the true 
covariance Cov{Xki, Xkj) by the same margin. 
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Finally, we remark briefly about another popular robust scale estimator known as the Qn 
estimator (Rousseeuw and Croux, 1993), defined as follows: 

Qn ^ ^ 

where c is a constant factor, chosen such that Qn is Fisher-consistent for the population standard 
deviation, and k* = \ ( 2 )/4]. Since the Qn estimator is also based on quantiles, essentially the same 
types of arguments used to derive MAD concentration (cf. Appendix B) may be used to establish 
concentration bounds for the Qn estimator similar to those appearing in Lemmas 10 and 11, up to 
constant factors. 


3.2 Precision Matrix Estimation 

Using the novel statistical error bounds derived in the previous section, we now provide statistical 
error bounds on the precision matrix estimators attained by plugging the robust covariance matrix 
estimates into the CLIME and GLasso. We provide explicit statements in the case of the covari¬ 
ance estimate based on Kendall’s tau; analogous statements hold for Spearman’s rho, assuming 
uniqueness of ranks. 

We begin with the CLIME estimator. Consider the following uniformity class of matrices: 


U{q, so{p),M) = <^ ^ 0, ||ri||Li < M, max \ujij\'^ < sq{p) 

I j 


(16) 


for 0 < q < 1, where Cl := (ojij) = (o^i,..., cjp). The following result provides an elementwise error 
bound on the estimation error between the CLIME output and the true precision matrix, provided 
the true precision matrix lies in the class (16) defined above: 

Theorem 3. Under the cellwise eontamination model (1), suppose inequality (11) is satisfied, and 
e = maxi<j<Bej < 0.02. Let C > 11^/2 and C > ^ ^and suppose inequal- 

-^ “ 'I>-i(0.75)mini<i<pC((Ti)V2" 

ity (12) also holds and $“^(0.75)6"-^/^^^ < 1. If the regularization parameter satisfies 


(17) 


A > M (C(M2 + + 1) + C'{2Mn + 1)) + M {97MI + 89M<, + 82) e, 

then with probability at least 

1 _ - gp-{2['I>~T0.75)]2C'2mini<i<pc2(o-i)-l}^ 

the CLIME estimator (10) satisfies 

\\n-n*\\^<A\\n*\\L,x. 

The proof of Theorem 3 is contained in Section 5.3. 

Remark 3. Clearly, the optimal ehoice of X to minimize the estimation error bound in Theorem 3 
is X = Cl + (726, where Ci and C 2 are the eonstant prefaetors appearing on the right-hand 

side of inequality (17). In this ease, the estimation error bound takes the form 


\n-n* 


< 4||n* 


ID I Ci\r^ + C2e] <4M\Ci 


n 


logp 


n 


+ (726 
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Turning to the GLasso, we focus on the class of precision matrices satisfying the following 
incoherence assumption: 

Assumption 1. There exists some 0 < a < 1 such that 

max||r:5(r^5)-i|U, <1-0, (18) 

where T* := S* (g) S* and S = supp(f2*) is the true edge set. 

We then have the following result, which is stated in terms of the population-level quantities 

= and Kr* = ||(n 5 )”'llLi, 

as well as k, the maximum number of nonzero elements in each row of ft*. The theorem also 
involves constants Cq, Ci, and C 2 , which are independent of e and the problem instances n, p, and 
k. 


Theorem 4. Under the cellwise contamination model (1), suppose inequality (11) is satisfied, and 
e = maxi<j<pej < 0.02. Also suppose the sample size satisfies the scaling 


n > C 2 Tlogp ■ 


( _ ^ _ 

V6(l -|- 8/o)fcmax{Ks*Kr*; 



(19) 


and suppose Assumption 1 holds. Suppose A = ^ ^Goe + Ci’sJ . Then with probability at 
least 1 — p^~'^, the GLasso estimator (8) satisfies supp(ri) C supp(r2*), and 


iifi - n*\ 


The proof of Theorem 4 is contained in Section 5.4. Note that Theorem 4 implicitly assumes 
that e < so that the expression in parentheses on the right-hand side of inequality (19) is 
positive. 

Remark 4. Comparing the results of Theorems 3 and 4, we see that as in the traditional uncon¬ 
taminated setting, the GLasso delivers slightly stronger guarantees, at the expense of more stringent 
assumptions. In particular, the GLasso requires the sample size to scale as n > Ck‘^logp, whereas 
the CLIME requires the scaling n > logp in order to achieve consistency. When the 

parameter M defining the precision matrix class scales more slowly than k‘^, the CLIME thus 
requires a weaker scaling. In addition, the GLasso result supposes Assumption 1, which posits 
an incoherence bound on submatrices of T*. On the other hand, Theorem 4 establishes that the 
supp(S4) C supp(f2*) for the GLasso estimator, whereas Theorem 3 only guarantees consistency 
for the CLIME estimator in terms of ioo-norm, so the estimated support might contain extraneous 
terms. In the case of the CLIME estimator, however, the true support of H* may be obtained via 

thresholding, assuming the nonzero elements of ft* are of the order Ll 

Focusing on the level of contamination e in relation to the problem dimensions, note that 
Theorems 3 and 4 both imply an O -|-0(e) error bound on the precision matrix estimator, 

under the corresponding assumptions. Hence, when e < Gy the estimation error matches the 
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error of the optimal precision matrix estimator in the uncontaminated case, up to a constant factor 
(Ren et ah, 2015). Further note that when e < the condition e = O (^) required by the 

condition (19) in Theorem 4 clearly holds when the sample size satisfies n > Ck‘^\ogp. Note that 
although the level of contamination tolerated by the estimator decreases as the level of sparsity 

increases, it is not required to decrease as n and p increase, as long as the ratio remains 

fixed. Thus, the conclusions of Theorems 3 and 4 are truly high-dimensional. As in the case of the 
robust covariance matrix estimators, another nice feature is that when the data are uncontaminated 
(e = 0), the rate of convergence of the robust precision matrix estimator to the true precision matrix 
agrees with the optimal rate. 

Lastly, note that since the inverse of the correlation matrix has the same support as the precision 
matrix, we could also estimate supp(f2*) using the Kendall’s or Spearman’s correlation matrices 
, defined by 



respectively, as inputs to the CLIME (10) or GLasso (8). Then the same derivations as in The¬ 
orems 3 and 4, omitting the concentration bounds on the MAD estimates of scale, would show 
convergence of and p^ to the population correlation matrix p* in .^oo-norm. Note, however, 
that the conditions for support recovery would then need to hold for the correlation matrix p*, 
rather than for the precision matrix fi*. In particular, a minimum signal strength requirement on 
p* is stronger than the same requirement imposed on il*, since the latter can scale inversely with 
the standard deviations of individual variables in the joint distribution. Therefore, we have chosen 
to focus our attention in this paper on the output of the CLIME and GLasso when applied to an 
estimate of the covariance matrix rather than the correlation matrix. 


3.3 Consequences for Robust Estimation 

We now interpret the conclusions of our theorems in some concrete settings, where the data matrix 
is contaminated according to several different mechanisms. 


Constant fraction of outliers: We first briefly discuss the most basic setting of cellwise con¬ 
tamination, to emphasize the generality of our results. Following the model (1), suppose each entry 
of the data matrix X is contaminated independently with probability e. Furthermore, either all 
contaminated entries may be drawn independently from a fixed contaminating distribution, or the 
contaminated entries in each row may be drawn jointly from a fixed contaminating distribution. In 
each case. Theorems 1 and 2 provide elementwise error bounds on the robust covariance estimators, 
and Theorems 3 and 4 provide elementwise error bounds on the robust precision matrix estimators 
constructed from the CLIME and GLasso. The strength of the theorems lies in the fact that we do 
not make any side assumptions about the outlier distribution; in particular, it may be heavy-tailed 
and/or contain point masses. Hence, whereas statistics such as the sample covariance and sample 
correlation will have slower rates of convergence due to a constant fraction of outliers drawn from 
an ill-behaved distribution, their robust counterparts are agnostic to the outlier distribution. 

It is also important to note that the statistical error bounds given in the theorems of Sections 3.1 

and 3.2 continue to hold when e > Cly The difference is that in such scenarios, the statistical 


error will be of the order 0(e) rather than O 


logp 

n 


However, the effect of an e fraction of outliers 


nonetheless grows only linearly as a function of e. This emphasizes the robustness properties of the 
covariance and precision matrix estimators studied in our paper. 
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Missing data: Turning to a somewhat different setting, note that missing data may also be 
seen as an instance of cellwise contamination. In this model, data are missing completely at 
random (MCAR), meaning that the probability of missingness is independent of the location of 
the unobserved entry of the data matrix (Little and Rubin, 1986). In other words, if we observe 
the matrix with missing entries, where the probability that an entry in column i is missing is 
equal to e*, we have 


Yfcj, with probability 1 — e^, 

missing, with probability e*. 


( 20 ) 


where Y is the fully-observed matrix. Note that if we zero-fill the missing entries of X™®, the 
resulting matrix X exactly follows the cellwise contamination model (1), with = 0 for all k. 
The following result is an immediate consequence of our theorems: 


Corollary 1. Suppose data are drawn from the missing data model (20), and the matrix X is the 
zero-filled data matrix. Let e = maxi<j<pei. Under the same conditions as in Theorem 3, we have 


<4||n*|U,A, 


for the robust CLIME estimator constructed from X. Under the same conditions as in Theorem 4, 
we have supp(n) C supp(n*) and 

I|fi - niloo < 2||(r55)-i||i, (^1 + , 

for the robust GLasso estimator constructed from X. 

Note that the conclusion of Corollary 1 does not actually require the matrix X to be zero-filled 
for missing values; in fact, we could fill the missing entries with samples generated according to 
any distribution (as long as the distribution remains the same across rows). This is because the 
missing entries are essentially taken as outliers. Of course, our bounds should only be interpreted 
up to constant factors, and filling missing entries in a strategic way, e.g., filling entries in column i 
with the mean E(Xki), could lead to smaller estimation error in practice. 


Rowwise contamination: Although we have thus far assumed that data are contaminated ac¬ 
cording to a cellwise mechanism, we now show that the same results apply for rowwise contamina¬ 
tion, as well. Recall that each row in the data matrix for the rowwise contamination model with 
contamination level e is given by 

Xk = {1 — Bk)Yk-I BkZk, \fl < k < n, (21) 

where Y^ is the uncontaminated row vector, is the contamination vector, and B^ ~ Bernoulli(e). 

Although model (21) differs from model (1), a simple inspection of the proofs of Theorems 3 
and 4 shows that only Lemma 1 needs to be modified. Furthermore, the equation (44), giving the 
distribution of pairwise entries in a row, simply needs to be replaced by the equation 

= (1 - VI </c < n, (22) 

in the proof of Lemma 1. Equation (22) comes from the fact that the pair is either drawn jointly 
from a normal distribution with probability 1 — e, or from the contaminating distribution with 
probability e. Then the remainder of the argument follows as before, implying that the same 
conclusion of Lemma 1 applies. (We could obtain a smaller prefactor for e in the bound (29), since 
2e is replaced by e, but we are not concerned about optimizing constants here.) 

We therefore arrive at the following result: 
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Corollary 2. Under the rowwise contamination model (21), the same conclusions as in Corollary 1 
hold for the CLIME and GLasso estimators constructed from X. 

We emphasize that the rowwise contamination model (21) is not in general a special case of 
the cellwise contamination model (1); rather, the proof techniques for analyzing the cellwise model 
may be used to handle the rowwise model, as well. 

4 Breakdown Point 

We now turn to a brief discussion of the breakdown point of the estimators studied in this paper. As 
is discussed in Donoho and Huber (1983) and Hampel et al. (2011), breakdown analysis concerns 
the global behavior of a procedure, under large departures from an assumed situation. On the 
other hand, the theoretical analysis of statistical consistency and efficiency are related to notions 
of infinitesimal robustness, and quantifies the local behavior of a procedure at or near the assumed 
situation. The analogy is made in Donoho and Huber (1983) between the fields of material science 
and statistics, where the notions of stiffness (resistance of a material to displacements caused by 
a small load) and breaking strength (the amount of load required to make the material fracture) 
parallel those of the influence function and the breakdown point. Ideally, a procedure should 
perform well both locally and globally; optimizing either measure alone is unwise. Our key result 
of this section shows that although the GLasso and CLIME estimators both enjoy roughly the 
same statistical rate of estimation, the CLIME does not perform as well as the GLasso when the 
breakdown point is used to quantify the degree of robustness. 

Our analysis of the GLasso estimator closely follows that of Oellerer and Croux (2014); however, 
since the specific precision matrix estimators analyzed in our paper differ slightly from those of 
Oellerer and Croux (2014), we include the full argument for the sake of completeness. We define 
the finite-sample breakdown point of the precision matrix estimator under cellwise contamination 
to be 

en(n,X):= min i — : sup D(n(X), ^(X™)) = oo I , (23) 

l<m<n y Tl J 

where 

D{A, B) := max {|Ai(A) - Ai(B)|, \\-\A) - A;HB)|} , 

and X™' is a data matrix obtained from X by replacing at most m entries in each column by 
arbitrary elements. We also define the explosion finite sample breakdown point of a covariance 
matrix estimator as follows: 

e+(S,X):= min { - : sup |Ai(S(X)) - Ai(S(X™))| = ooj (24) 

l<m<n n yjn J 

(Maroima and Zamar, 2002). Note that the explosion breakdown point only accounts for maxi¬ 
mum eigenvalues, whereas the overall covariance matrix estimator breaks down under explosion or 
implosion (i.e., arbitrarily small minimum eigenvalues). Also, the breakdown point under cellwise 
contamination is less than or equal to the breakdown point under rowwise contamination, since the 
supremum in the latter case is only taken over X™ with at most m rows replaced. 

We will consider the breakdown behavior of a slightly tweaked version of the GLasso presented 
earlier, using a positive semidefinite matrix as the input to the optimization problem. Consider the 
matrix 

S(X) := argmin ||S — M||oo, (25) 

M^O 
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where S = 5](X) is the robust covariance matrix estimator constructed from the data matrix X. 
Let 

fi(X) := argmin{tr(Sr2) — logdet(r2) + A||S7||i^ofr} (26) 

be the corresponding GLasso estimator. Note that from a computational standpoint, the projection 
step (25) is important so that fast solvers for the GLasso program (26) may be applied (e.g., 
Friedman et al. (2008)). Furthermore, we note that the projection step (25) constitutes a convex 
program, so the additional computational time is negligible compared to the computation required 
for running the GLasso. We have the following result: 

Theorem 5. Consider the positive semidefinite version of the robust GLasso estimator (26). Then 
under the same conditions as in Theorem 4, we have supp(fi) C supp(ri*) and 

+ (27) 


in - n* 


oo ^ 


2||(r 


ss) 


\-ll 


1 + 



Furthermore, for any data matrix X G the breakdown point satisfies Cn (n,x) = 50%. 

The proof of Theorem 5 is provided in Section 5.5. 

Remark 5. Note that Theorem 5 guarantees that the robust GLasso estimator Cl obtained from a 
semidefinite projection of the robust covariance estimator shares the same level of statistical consis¬ 
tency achieved by the robust GLasso estimator Cl. In addition, the precision matrix estimator Cl has 
a breakdown point of 50%. Although other authors (Oellerer and Croux, 2014; Tarr et al., 2015) 
also suggest projecting the robust covariance estimator onto the positive semidefinite cone before 
applying the GLasso, they advocate a projection in terms of the Frobenius norm rather than the 
i^o-norm in the optimization program (25). As can be seen in the proof of Theorem 5, minimizing 
the elementwise ioo-norm is much more natural from the point of view of statistical consistency, 
since it guarantees that the ^oo-error between the precision matrix estimate and the true precision 
matrix grows by at most a factor of two. 

Turning to the GLIME estimator, we now show that although the CLIME is as robust as the 
GLasso in terms of statistical consistency under the cellwise contamination model, it has much 
poorer breakdown behavior. Consider the CLIME estimator based on corrupted data: 


min ||n||i 

s.t. ||x(x’")n-/||oo < A, 


(28) 


where X(X™') is the robust covariance estimator based on a data matrix with at most m arbitrarily 
corrupted entries per column. Since the CLIME estimator arises as the solution to a constrained 
linear program, the solution is undefined (infinite) when the problem is infeasible. Indeed, we will 
show in the following theorem that such a case may arise even by corrupting at most one entry in 
each column of the data matrix. 

Theorem 6. In the case when p = 2, there exists X G such that Cn = h where Cl 

denotes the GLIME estimator. 


The proof of Theorem 6, supplied in Section 5.6, provides the construction of a data matrix 
X G where the CLIME estimator becomes infeasible after perturbing a single entry in each 

column. This is in stark contrast to the result in Theorem 5, which establishes that the breakdown 
point of the robust GLasso estimator is 50%, for any realization of the data matrix X. 
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Remark 6. Although Theorem 6 is stated for the case p = 2, the argument used to prove the 
theorem is readily generalizable to higher dimensions, as well, in which case we would also have a 
matrix X G satisfying e„ = i. For instance, we could construct an n x p matrix 

such that S(X^) is a block matrix with upper-left block equal to the matrix constructed in the proof 
of Theorem 6, lower-left block equal to the identity, and off-diagonal blocks equal to zero. 

The conclusion of Theorem 6 underscores the fact that consistency and breakdown point under 
cellwise contamination are in some sense orthogonal measures of robustness. As we demonstrated 
in the previous section, both the CLIME and GLasso lead to estimators that enjoy good rates of 
statistical consistency when the contamination fraction e is sufficiently small relative to the problem 
parameters. On the other hand, the results of this section show that the CLIME is extremely non- 
robust in terms of its breakdown point. Similarly, procedures such as the Gnanadesikan-Kettenring 
estimator (15) may be shown to be statistically consistent under cellwise contamination, but as 
discussed in Oellerer and Croux (2014), the breakdown point of the covariance estimator X is at 
most 25%, which leads to error propagation in Fl. 

Finally, we note that the notion of breakdown point that we consider in equation (23) is defined 
with respect to a finite sample, without recourse to probability distributions. Other notions of 
breakdown point, defined with respect to an e-contaminated distribution, have also been studied 
in the literature (Hampel et ah, 2011). For some alternative measures of breakdown robustness, 
the CLIME estimator may have a more controlled breakdown behavior, but we have not explored 
them here. 


5 Proofs 

In this section, we provide an outline of the proofs of the main theorems in the paper. Proofs of 
the more technical supporting lemmas are contained in the supplementary Appendix. 


5.1 Proof of Theorem 1 


The proof is based on Lemma 1, which gives an error bound for the pairwise terms sin(^r^), and 
Lemma 2, which gives an error bound for the scale estimates dj. Note that we require the bound 
e < 0.02 on the level of contamination in Lemma 1, but the requirement could be relaxed with 
a more refined proof technique. The proofs of Lemmas 1 and 2 are provided in Appendices A.l 
and A.2. 


Lemma 1. Under model (1), let e = maxi<j<pej < 0.02. For any constant C > vr-v/^; we have 


max 

1<2 J<p 


< C 


logp 


+ 267re, 


(29) 


with probability at least 1 — 2p 

Lemma 2. Under model (1), suppose 0 < mini<j<pcrj < maxi<j<p(Tj < M^, and the maximum 

, and 


contamination error satisfies e = maxi<j<pej < Let c{ai) = ^xp 


(l.lf7i+0.5)2 

2<t2 


suppose C > ^ 

$~i(0.75) mini<i<p Cl 

at least 1 — bo.75)]%'2mmi<i<pc2(o-i)-i}^ have 


Also suppose <I> ^(0.75)C"-y/^^ < 1. Then with probability 


max cTj — (Tj 

1<2<P 


<c\l^ 

V n 


+ 7.2M„€. 
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Using the fact that 


ABC - abc ={A-a){B- b){C - c) + aC{B -b)+ Ab{C - c) + Bc{A - a), 

we can decompose \aiaj sin(|r^) — Sljj l = sin(|r^) — aiajp^jl by the triangle inequality, as 
follows: 


aiaj j - aiajPij 




sin I \^ij ) - Pij 


+ 


^ K 
Gi Sin ( -r^j 


\(y j ^7 


sin I ) - P 




W 

^ Gj\ 


Sin - Pij 


+ |di(Tjf| 


I j - P^J 


-|- liTj Pij 11<^2 (7i\ 

(7i\dj — (7j\ 
H“ ^j \ ^i ^i\ 


^ \^i ^j\ 


+ {\^i - CTil + ai)aj 


U]-p.. 
IJ J Hij 

+ (7 



(2 

■ Pij 


+ i\^j ~ ^j\+ ^j)\^i - 


where (i) uses the facts that | sin(x)| < 1 for all x, and \pij\ < 1, since it is a correlation coefficient. 
Using Lemmas 1 and 2 and the assumption (12), we obtain the overall upper bound 


C 


logp 


n 


+ 267re C' 


-/ /logp 


+ 7.2M^e + C' 


n 


7 /logp 


+ 7.2M^e 


n 


+ M^ + C 


/ Bogp 


+ 7.2M,e W C 


n 


logp 


n 


+ 267re \ + \ C 


/ /logp 


+ 7.2M^e 


n 


logp 


< (M,(M, + 1) + 1) C\ -AAL + 267re + (2M, + 1) C' 


n 


7 /logp 


+ 7.2M„e 


n 


implying inequality (13). 


5.2 Proof of Theorem 2 

The proof is based on Lemma 3, which gives an error bound for 2sin(^r^), and Lemma 2, which 
gives an error bound for dj. Note that we require the bound e < 0.01 on the level of contamination 
in Lemma 3, but the requirement could again be relaxed with a more refined proof technique. The 
proof of Lemma 3 is contained in Appendix A.3. 

Lemma 3. Under model (1), let e = maxi<j<pej < 0.01. Suppose C > Sir and the sample size 
satisfies n > maxi 15, Then 


max 





+ 5l7re, 


with probability at least 1 


2p 


-^- 2 ) 

I 327r2 J 


(30) 
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Using a similar decomposition as in the proof of Theorem 1, we have 


2aiaj sin ( j - 


+ {\^i - (yi\ + cri)(yj 


2 sin(^rg) - 




“h (T-j; I (Jj (J j 


TT 


2 sin -r^ - - p 


'ij 


+ {\aj - (Tj\+ crj)\ai - fjj|. 


Using Lemmas 2 and 3, we then obtain the overall upper bound 

\ 2 


f + cy‘^ + 7,2M,d +M, cy‘^ + 7,2M,. 


+ M, + Cy!^^ + 7.2M,d M, + + cy!^+7.2M,. 


< (M^(M^ + 1) + 1) ^ + SlTrej + (2M^ + 1) + 7.2M^ej , 

which is easily simplified to obtain the prescribed bound. 


5.3 Proof of Theorem 3 

Clearly, it suffices to prove the elementwise deviation bound for the unsymmetrized matrix ft. 
We begin with the following general lemma, relating deviation bounds in the covariance matrix 
estimator S to the error of the CLIME estimator. A version of the following result appears in Cai 
et al. (2011), but we include the relatively short proof for the sake of completeness. 

Lemma 4. Suppose ft* G hl{q, so{p), M). If ft is the output of the CLIME estimator (10), where 
the regularization parameter satisfies A > M||l] — S*||ooj then ||r2 — i2*||oo < 4||r2*||i^A. 


Proof. We have 


|I - SfiYIoo = 11(51 - 5]*)fi1|oo < \\ft 




< A, 


(31) 


the first inequality is due to ||AB||oo < ||A||oo||B||i^, and the second inequality follows by assump¬ 
tion. Then 

\\±in - n*)||oo < ||sn - I||oo + III - tft*\\^ < 2A. 

For 1 < i < p, let ei be the canonical vector with 1 in the coordinate and 0 in all other 
coordinates, and let (3^ be the solution of the following convex optimization problem: 


min ||/3||i subiect to 
peRp 


1^/^ ^zlloo ^ A. 


Note that ft = {jSi,..., /3p) (cf. Lemma 1 in Cai et al. (2011)). It follows that ||/3J|i < ||ri*||i^, 
for 1 < i < p, so ||n||i^ < ||i2*||2,j. Hence, 

\\'E*{{t - n*)||oo < \\t{ft - n*)||oo + ||(S - ^*){ft - nYlloo 
<2 A+||n-f2 *||Lill^-S*||oo 
<2A+||n||Lj|5]-S* 

< 4A. 


+ ||nYlLil|51-51*1100 
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Finally, 


\n - n*||oo = - fi*)||oo < ||ni|Lj|s*(n - n*)||oo < 4||n*|U,A. 


□ 

Combining Lemma 4 with the result of Theorem 1, we obtain the desired result. 

5.4 Proof of Theorem 4 

Our proof is based on the following result: 

Lemma 5 (Theorem 1 in Ravikumar et al. (2011)). Suppose Li* satisfies the incoherence condi¬ 
tion (18), and that for all 1 < i,j < p, the tail condition 


w>o, 


(32) 


holds, for some function f that is monotonically increasing in n. Also suppose the sample size 
satisfies 

1 


n > Uf 


6(1 + 8/a)/cmax{«;s*Kr*, ’ 


P 


where 

hf{5;r) = argmaxjn : f{n,6) < r}, and 5f{n;r) := argmax{(5 : f{n,S) < r}. 

Then with probability at least 1— for the choice A = ^5f{n,p'^), the GLasso estimator satisfies 

ll^ - ^*\\oo < 2Kr* ^1 + ^f{n,p'^), 

and 

supp(fi) C supp(fJ*). 

Inspecting the proofs of the technical lemmas employed in proving Theorem 1, we may see that 
inequality (32) holds with the function f{n,5) = ciexp(c2n(5 — cqc)^), defined for 6 > cqc, where 
Co,Cl, and C 2 are appropriately chosen constants. An easy calculation shows that 


j,(n,r) = c„e+J^log()-), 


SO 


6f{n,p'^) = Coe + Ci 
Similarly, we may easily verify that 


r logp 


n 


n 


f{S,pl = C2 


rlogp 
((5 - coe)2‘ 


Lemma 5 then implies that the desired conclusions. 
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5.5 Proof of Theorem 5 

Note that S is the projection of the robust covariance estimator 5] onto the positive semidehnite 
cone, where the distance is measured in the elementwise £oo-iiorm. Furthermore, note that 

11^ ~ S||oo < ||S* — S||oo, 


since 'S* ^ 0. Hence, 

HE - S*||oo < ||s - S||oo + ||s - E*||oo < 2||s - E*||oo. (33) 

This implies that the bound (32) in Lemma 5 holds with S replaced by S, and /(n, 6) replaced by 
/(n, 5/2). Proceeding as in the proof of Theorem 4 with these minor modifications, we arrive at 
the bound (27). 

Turning to the derivation of the breakdown point, note that by Theorem 1 of Oellerer and 
Croux (2014), we have 

e„(n(X),X)>e+(S(X),X). (34) 

We first show that 

e+(S(X),X) >50%. (35) 

Consider the estimator E(X”*), based on corrupted data. We have 

||E(X™) - S*||oo < 2||E(X”^) - S*||oo 

<2||E(X”*)||oo + 2||E*||oo, (36) 

where the first inequality follows from the bound (33), and the second inequality comes from the 
triangle inequality. Furthermore, note that since S(X"*) > 0 by construction, we have 

Ai(E(X”^)) = ||E(X”*)||2 < ||X(X™) - E*||2 + ||E*||2 < p||S(X™) - E*||oo + ||X*||2, (37) 

where we have used the bound 

||A||oo < ||A||2 <p||A||oo, 

in the last inequality. Combining inequalities (36) and (37), we then obtain 

Ai(S(X"^)) < 2p||E(X™)||oo + 2p||S*||oo + ||S*||2, 


|Ai(S(X”^)) - Ai(S(X))| < Ai(S(X)) + (2p||E(X™)||oo + 2p||S*||oo + \\'S*h) . (38) 

Finally, since the correlation estimators are bounded in magnitude by 1, we have 

||S(X”*)||oo < max <T,(X™)d,(X"^), (39) 

where {(Tj(X”*)}^<^<p are the robust scale estimators based on X”^, given by the MAD estimators 
calculated from the corresponding columns. Furthermore, the breakdown point of the MAD is 
50% (Huber, 1981), meaning the quantity on the right-hand side of inequality (39) is finite when 
^ < 50%. Then by inequality (38) and the definition of the explosion breakdown point, we conclude 
that the bound (35) holds. By inequality (34), we therefore have en(Li(X),X) > 50%, as well. 

We now establish that en(fi(X),X) = 50%. Note that if we are allowed to corrupt more than 
50% of the entries in each column of the data matrix, the columnwise MAD estimates may be made 
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arbitrarily small (say, smaller than some value a); indeed, we may simply replace more than half 
of the entries in each column by values in (0, a). Consequently, the overall covariance estimator 
will have all entries bounded in magnitude by [<l>“^(0.75)]“^a^. We claim that the diagonal 
elements of S(X™') must therefore be bounded in magnitude by 2[‘h“^(0.75)]“^a^. Indeed, note 
that the matrix diag(5](X’")) is feasible for the projection (25). Hence, we must have 

||S(X™) - X(X’")||oo < ||X(X"^) - diag(S(X”^))||oo < [$-n0.75)]-V, 
implying in particular that 

||diag(S(X-))||oo < ||diag(X(X-))||oo + ||diag(S(X™)) - diag(S(X-))|U < 2[ci>-i(0.75)]-2a2, 


as claimed. Now note that the first-order optimality condition for the GLasso is given by 
X(X’") - (n(X”*))"^ + A • sign{n(X”*) - diag(n(X”^))} = 0, 


where the sign function is computed entrywise, omitting the diagonal elements of i7(X™'). In 
particular, this implies that the diag(S(X™')) = diag | (fI(X™')) so the diagonal elements of 

(fi(X™')) ^ are also bounded in magnitude by 2[<I>“^(0.75)]“^a^. Hence, 


Xp 




< 

< 


min v'^f(n(X”^)) Mv 
||v ||2 = l 7 

imm ej ((^(X-))-') e,- 

|diag{(n(X™))-'}||^ 


< 2[^>■l(0.75)]■2a^ 


where the e^-’s are the canonical basis vectors, and we have used the variational representation 
of eigenvalues of a Hermitian matrix to show that the minimum eigenvalue is bounded by the 
minimum diagonal entry. This allows us to conclude that 

1 = Ap ■ (n(X™))“^) < Ai (fJ(X™)) -Ap ((n(X™))"^) < Ai (fJ(X”^)) •2[$-^(0.75)]-V, 


where we have used the inequality Xp{AB) < Ai(A)Ap(B), for A,S ^ 0, in the first inequality 
(Zhang, 2011). Hence, 


Ai (fi(X'")) > 


[$-1(0.75)]^ 


However, we may choose a to be arbitrarily close to 0, implying that the maximum eigenvalue of 
$1(X™') may be made arbitrarily large, and the estimator breaks down. This concludes the proof. 


5.6 Proof of Theorem 6 

Clearly, (n,x) > X for any X, by the definition of the breakdown point. To show equality, we 
now provide a data matrix X and a corrupted data matrix X^, where X^ differs from X in at most 
one element per column, and the CLIME problem is feasible for S(X) but infeasible for X(X^). 
Consider the n x 2 matrix X^, constructed as follows: 


xi = 


/ ai -ai \ 
02 —02 

\ 0‘n Oji J 
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where the a^s are all distinct. Note that the columns of are perfectly negatively correlated; 
hence, the correlation matrix (computed from either Kendall’s tau or Spearman’s rho, for instance) 
is 



Furthermore, we have a\ = (T 2 := ft, since the data in the two columns are negatives of each other. 
It follows that 


S(X^) = <7^ 


1 -1 

-1 1 


Clearly, the problem 


/3i: 


X(Xi)/3i 


1 

0 



< A 


is infeasible for A < ^. Hence, the CLIME estimator based on S(X^) is infeasible. 

On the other hand, we may construct an initial data matrix X such that the CLIME program 
based on S(X) is feasible, simply by altering the last row of X^. Suppose we change the last row 
of X^ to {an,an)- Then the columns are no longer perfectly negatively correlated, and it is easy to 
check that the correlation matrix of X will take the form 


1 

a 


a 

1 


for some |a| < I. Denoting the corresponding estimates of scale as di and <72, we then have 


S(X) 


df adid2 

fldid2 d'2 


Note that det{5](X)} = d^d|(l — a^) > 0. It follows that S(X) is invertible. In particular, the 
matrix ^S(X)^ is always a feasible point for the CLIME program based on X(X). 

Hence, we conclude that the CLIME program breaks down when even one corruption per column 
is allowed. It follows that (n,x) = A for the constructed value of X. 


6 Simulations 

In this section, we perform simulation studies to examine the performance of the two robust co- 
variance matrix estimators introduced in Section 2, and also the robust precision matrix estimators 
obtained using the GLasso. We will refer to the two type of estimators as Kendall and Spearman, 
respectively. 

For comparison, we also compute the following robust covariance matrix estimators, which are 
similarly plugged into the GLasso to obtain robust precision matrix estimators; 

• SpearmanU: The pairwise covariance matrix estimator proposed in Oellerer and Croux (2014), 
where the MAD estimator is combined with Spearman’s rho (without transformation): 

'Sij = ai&jrfj, where dj = [4>“^(0.75)]“^(lj. 

• OGK: The OGK estimator proposed in Maronna and Zamar (2002), with scale estimator Qn- 
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NPD: The pairwise covariance matrix estimator considered in Tarr et al. (2015), where 


is the Qn statistic computed from {X/^i + : 1 < k < n}, and _ is the Qn 

statistic computed from {X^i — X^j ■. 1 < k < n}. An NPD projection is applied to 5] to 
obtain the final positive semidefinite covariance matrix estimator: 

S = min 11^ — Mlli?. 

M^o 

Further details for the orthogonalized Gnanedesikan-Kettenring (OGK) and nearest positive defi¬ 
nite (NPD) procedures may be found in Maronna and Zamar (2002) and Higham (2002), respec¬ 
tively. The nonrobust GLasso, which takes the sample covariance matrix estimator as an input 
(SampleCov), as well as the inverse sample covariance matrix estimator (invCov), applicable in the 
case p < n, are used as points of reference. 

An implementation of the GLasso that allows the diagonal entries of the precision matrix 
estimator to be unpenalized is provided in the widely used glasso package. In this paper, however, 
we use the GLasso implementation from the QUIC package (Hsieh et ah, 2011), since it does not 
require the input covariance matrix to be positive semidefinite, and speeds up substantially over 
glasso. We select the tuning parameter A in GLasso by cross-validation: We first split the data 
into K groups, or folds, of nearly equal size. For a given A and 1 < A; < iF, we take the k^^ fold as 
the test set, and compute the precision matrix estimate based on the remaining K — 1 folds. 

We then compute the negative log-likelihood on the test set: 

= -logdetnj^”^^ +tr , 

- (k) 

where X) is the robust covariance estimate obtained from the test set. This is done over a 
logarithmically spaced grid of 15 values between Amax = maxj^j and Amin = O.OlAmax, where 
5] is the robust covariance estimate computed from the whole data set. The value of A that 
minimizes 

k=l 

is selected as the final tuning parameter. 

Simulation settings: We consider the following four sampling schemes, covering different struc¬ 
tures of the true precision matrix ft* G The first three structures come from Gai et al. 

(2011). 

• Banded: fit- = 

•'J 

• Sparse: Li* = B -|- hip, where bu = 0 and bij = bji, with P{bij = 0.5) = 0.1 and P{bij = 0) = 
0.9, for i j. The parameter 5 is chosen such that the condition number of Cl* equals p. The 
matrix is then standardized to have unit diagonals. 

• Dense: = 1 and fl*j = 0.5, for i j. 

• Diagonal: fl* = Ip. 
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For each sampling scheme and dimension p G {120,400}, we generate B = 100 samples of 
size n = 200 from the multivariate normal distribution N{0, We then add 5% or 10% 

of rowwise or cellwise contamination to the data, where the outliers are sampled independently 
from iV(10,0.2). We also simulate model deviation by generating all observations from either 
the multivariate t-distribution, ^ 3 ( 0 , or the alternative t-distribution, ^ 3 ( 0 , ( 11 *)“^), each 

with three degrees of freedom. Recall that X ~ where tj,(0, denotes the 

multivariate t-distribution with u degrees of freedom, if 

X = Y/^, 

where Y ~ Y(0, ($1*)“^) and r ~ V{yl‘l). The alternative t-distribution, denoted by t*, is 
proposed in Finegold and Drton (2011) as a generalization of the multivariate t-distribution. We 
say that X ~ t*{0, (0*)“^) if 

Yi = VI < i < p, 

where the p divisors Ti ~ F(i//2, vjY) are independent. In this case, the heaviness of the tails are 
different for different components of X. 

Performance measures: We assess the performance of the covariance and precision matrix 
estimators via the deviations ||X — X*||oo and ||r2 — ri*||oo, respectively. To measure the accuracy 
of recovering the support of the true precision matrix, we also consider the false positive (FP) and 
false negative (FN) rates: 

pp _ l{(bi) • 7 ^ 0 ) = 0)1 , p^j _ |{(bj) • ^ij = 0 , rijj 7 ^ 0 }| 

FP gives the proportion of zero elements in the true precision matrix that are incorrectly estimated 
to be nonzero, while FN gives the proportion of nonzero elements in the true precision matrix that 
are incorrectly estimated to be zero. Note that if has no zero entries, as in the case of the 
banded and dense structures, the quantity FP is undefined. 

Tables 1 and 2 show the results for n = 200 and p = 120. We summarize the salient points 
below: 

• When the dataset is clean, SampleCov performs best in terms of both covariance and pre¬ 
cision matrix estimation, across all sampling schemes. Note that even though the data are 
uncontaminated, InvCov performs poorly, due to the fact that the sample covariance matrix 
has low precision when p > n/ 2 . 

• In the case of rowwise contamination, the nonrobust SampleCov has the largest estimation 
error for the covariance matrix, as expected. Curiously, the precision matrix estimation error 
based on SampleCov is the lowest among all estimators. We do not have good explanation 
for this, but the tuning parameter selected for SampleCov by cross-validation tends to be 
smaller (as can be seen from its relatively low FN). NPD, Kendall, Spearman, and SpearmanU 
have similar performance in terms of both covariance and precision matrix estimation. In all 
sampling schemes, OGK outperforms these four estimators for covariance estimation, but not 
consistently so for precision matrix estimation. 

• For covariance and precision matrix estimation under cellwise contamination, the Kendall, 
Spearman, and SpearmanU estimators perform the best. NPD performs the worst among all 
cellwise robust covariance matrix estimators. Nonetheless, NPD still beats OGK, which is de¬ 
signed to work well under rowwise contamination, and also beats the nonrobust SampleCov. 
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• When the data are generated from the multivariate t-distribution or alternative t-distribution, 
we again see that Kendall, Spearman, and SpearmanU behave similarly and outperform all 
other estimators, across all sampling schemes. 

• When fl* is either sparse or diagonal, FP is low for all estimators except InvCov, under all 
contamination mechanisms. 

• Except for InvCov, FN is high when ft* is banded or dense, under all contamination mecha¬ 
nisms. This is expected because GLasso implicitly assumes the underlying to be sparse, 
which is not true in these cases. When ft* is sparse, the FN for Kendall, Spearman, and 
SpearmanU are relatively low compared to the other estimators. 

Tables 3 and 4 show the results for n = 200 and p = 400. Since p > n, the inverse sample covariance 
matrix cannot be computed, hence is excluded from the analysis. Overall, we obtain conclusions 
similar to those obtained in the first set of simulations: 

• When the data are clean, SampleCov perform best in terms of estimation error, across all 
sampling schemes. Immediately following are OGK and NPD, and then Kendall, Spearman, 
and SpearmanU (the last three have nearly the same performance). 

• Under rowwise contamination, SampleCov has the worst covariance estimation error, but also 
the best precision estimation error, across all sampling schemes. OGK performs best in terms of 
covariance estimation, but not precision estimation. NPD, Kendall, Spearman, and SpearmanU 
have similar performance in nearly all cases. When fl* is diagonal and the contamination 
fraction is 10%, Kendall turns out to have high precision estimation error, possibly because 
the selected tuning parameter in GLasso is too small (as can be seen by the high FP). 

• In terms of estimation error under cellwise contamination, OGK performs nearly as badly as 
SampleCov. Kendall, Spearman, and SpearmanU perform equally well, while NPD is slightly 
worse off. 

• When the data are generated from the multivariate t-distribution or alternative t-distribution, 
SampleCov performs badly. Kendall, Spearman, and SpearmanU perform similarly and out¬ 
perform OGK and NPD, across all sampling schemes. 

• In general, under all contamination mechanisms, when ft* is either sparse or diagonal, FP 
is low for all estimators. On the other hand, when ft* is banded or dense, FN is high, as 
expected. When ft* is sparse, FN is not as low as desired. 

In summary, SampleCov performs best for clean data. Under rowwise contamination, OGK 
yields the best results in terms of covariance estimation. Under cellwise contamination, Kendall, 
Spearman, and SpearmanU equally share the best performance, while NPD is slightly worse off. 
Kendall, Spearman, and SpearmanU also perform very well when the data are generated from 
a multivariate t-distribution or the alternative t-distribution, although these latter cases are not 
covered by our theory. 
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clean 

5% rowwise 

10% rowwise 



Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 


SampleCov 

1.11 

0.30 


0.85 

5.91 

0.31 


0.60 

10.44 

0.31 


0.61 


DGK 

1.20 

0.32 


0.88 

1.98 

0.37 


0.90 

2.91 

0.41 


0.91 


NPD 

1.26 

0.35 


0.96 

2.24 

0.37 


0.72 

3.39 

0.39 


0.71 

Banded 

Kendall 

1.73 

0.33 


0.87 

2.50 

0.32 


0.63 

3.37 

0.31 


0.63 


Spearman 

1.73 

0.33 


0.87 

2.50 

0.33 


0.64 

3.37 

0.33 


0.64 


SpearmanU 

1.73 

0.34 


0.88 

2.50 

0.34 


0.64 

3.37 

0.34 


0.63 


InvCov 

1.11 

1.68 


0.00 

5.91 

1.83 


0.00 

10.44 

2.09 


0.00 


SampleCov 

0.70 

0.34 

0.19 

0.11 

5.57 

0.35 

0.36 

0.30 

10.09 

0.32 

0.36 

0.32 


DGK 

0.79 

0.39 

0.18 

0.15 

1.62 

0.51 

0.18 

0.20 

2.39 

0.59 

0.17 

0.24 


NPD 

0.82 

0.47 

0.09 

0.32 

1.63 

0.55 

0.21 

0.66 

2.58 

0.61 

0.20 

0.76 

Sparse 

Kendall 

1.15 

0.43 

0.17 

0.16 

1.63 

0.41 

0.32 

0.37 

2.36 

0.40 

0.32 

0.41 


Spearman 

1.15 

0.43 

0.17 

0.16 

1.64 

0.43 

0.32 

0.37 

2.38 

0.43 

0.31 

0.42 


SpearmanU 

1.15 

0.45 

0.17 

0.15 

1.65 

0.45 

0.33 

0.36 

2.37 

0.46 

0.31 

0.41 


InvCov 

0.70 

2.83 

1.00 

0.00 

5.57 

3.14 

1.00 

0.00 

10.09 

3.54 

1.00 

0.00 


SampleCov 

0.60 

0.60 


0.99 

5.54 

0.61 


0.75 

10.05 

0.60 


0.75 


□GK 

0.63 

0.61 


0.99 

1.18 

0.68 


0.99 

1.88 

0.74 


0.99 


NPD 

0.67 

0.62 


0.99 

1.23 

0.65 


0.82 

1.89 

0.69 


0.79 

Dense 

Kendall 

1.00 

0.66 


0.99 

1.37 

0.64 


0.79 

1.91 

0.64 


0.78 


Spearman 

1.00 

0.66 


0.99 

1.37 

0.64 


0.79 

1.91 

0.64 


0.77 


SpearmanU 

0.99 

0.66 


0.99 

1.37 

0.64 


0.78 

1.91 

0.65 


0.77 


InvCov 

0.60 

2.63 


0.00 

5.54 

1.28 


0.00 

10.05 

1.48 


0.00 


SampleCov 

0.30 

0.31 

0.00 

0.00 

5.31 

0.26 

0.24 

0.00 

9.84 

0.28 

0.24 

0.00 


DGK 

0.32 

0.33 

0.00 

0.00 

0.55 

0.35 

0.00 

0.00 

0.80 

0.44 

0.00 

0.00 


NPD 

0.33 

0.35 

0.00 

0.00 

0.63 

0.31 

0.18 

0.00 

0.98 

0.39 

0.21 

0.00 

Diagonal 

Kendall 

0.51 

0.62 

0.00 

0.00 

0.68 

0.51 

0.20 

0.00 

0.96 

0.46 

0.21 

0.00 


Spearman 

0.51 

0.62 

0.00 

0.00 

0.68 

0.52 

0.21 

0.00 

0.96 

0.47 

0.22 

0.00 


SpearmanU 

0.51 

0.62 

0.00 

0.00 

0.68 

0.52 

0.21 

0.00 

0.96 

0.45 

0.23 

0.00 


InvCov 

0.30 

2.81 

1.00 

0.00 

5.31 

3.19 

1.00 

0.00 

9.84 

3.60 

1.00 

0.00 


Table 1; Simulation results for seven estimators and four sampling schemes, when n = 200 and p = 120. Performance is measured by 
||S — X)*||oo for covariance matrix estimation (Cov), ||r2 — ri*||oo for precision matrix estimation (Prec), and false positive rate (FP) and 
false negative rate (FN) for support recovery of the true precision matrix. The results are averaged over 100 replications. 






5% cellwise 

10% cellwise 

multivariate t 

alternative t 



Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 


SampleCov 

8.33 

0.51 


0.97 

13.09 

0.54 


0.99 

18.31 

0.49 


0.87 

57.72 

0.57 


0.93 


OGK 

8.10 

0.51 


0.95 

13.15 

0.54 


0.99 

3.85 

0.43 


0.92 

12.15 

0.53 


0.92 


NPD 

2.78 

0.41 


0.95 

4.70 

0.46 


0.96 

4.06 

0.44 


0.96 

4.53 

0.46 


0.96 

Banded 

Kendall 

2.43 

0.40 


0.92 

3.67 

0.45 


0.92 

3.32 

0.41 


0.90 

3.60 

0.42 


0.90 


Spearman 

2.43 

0.41 


0.92 

3.67 

0.45 


0.92 

3.32 

0.41 


0.91 

3.60 

0.42 


0.90 


SpearmanU 

2.43 

0.41 


0.93 

3.67 

0.45 


0.93 

3.32 

0.42 


0.91 

3.60 

0.43 


0.90 


InvCov 

8.33 

0.41 


0.00 

13.09 

0.46 


0.00 

18.31 

1.26 


0.00 

57.72 

0.53 


0.00 


SampleCov 

8.39 

0.90 

0.05 

0.81 

13.25 

0.93 

0.01 

0.91 

11.47 

0.77 

0.14 

0.43 

32.95 

0.94 

0.12 

0.44 


OGK 

8.18 

0.90 

0.06 

0.77 

13.71 

0.94 

0.01 

0.90 

3.38 

0.65 

0.16 

0.23 

8.67 

0.86 

0.16 

0.34 


NPD 

2.15 

0.61 

0.06 

0.45 

4.04 

0.73 

0.05 

0.59 

3.17 

0.69 

0.08 

0.45 

3.31 

0.71 

0.07 

0.49 

Sparse 

Kendall 

1.58 

0.61 

0.16 

0.30 

2.44 

0.72 

0.13 

0.46 

2.34 

0.58 

0.15 

0.25 

2.32 

0.62 

0.16 

0.22 


Spearman 

1.58 

0.62 

0.15 

0.30 

2.44 

0.73 

0.13 

0.46 

2.34 

0.59 

0.15 

0.25 

2.32 

0.62 

0.15 

0.23 


SpearmanU 

1.58 

0.63 

0.16 

0.30 

2.44 

0.73 

0.13 

0.46 

2.34 

0.60 

0.15 

0.25 

2.32 

0.63 

0.16 

0.22 


InvCov 

8.39 

0.77 

1.00 

0.00 

13.25 

0.85 

1.00 

0.00 

11.47 

2.10 

1.00 

0.00 

32.95 

0.87 

1.00 

0.00 


SampleCov 

8.39 

0.90 


0.99 

13.25 

0.93 


0.99 

10.06 

0.88 


0.98 

31.24 

0.95 


0.99 


OGK 

8.02 

0.90 


0.99 

13.14 

0.93 


0.99 

2.14 

0.76 


0.99 

6.82 

0.89 


0.99 


NPD 

1.51 

0.71 


0.99 

2.64 

0.78 


0.99 

2.21 

0.76 


0.99 

2.50 

0.78 


0.99 

Dense 

Kendall 

1.36 

0.70 


0.99 

2.00 

0.75 


0.99 

1.84 

0.74 


0.99 

2.08 

0.75 


0.99 


Spearman 

1.36 

0.70 


0.99 

2.00 

0.75 


0.99 

1.84 

0.74 


0.99 

2.08 

0.75 


0.99 


SpearmanU 

1.36 

0.70 


0.99 

2.00 

0.75 


0.99 

1.84 

0.74 


0.99 

2.08 

0.75 


0.99 


InvCov 

8.39 

0.78 


0.00 

13.25 

0.85 


0.00 

10.06 

1.88 


0.00 

31.24 

0.88 


0.00 


SampleCov 

8.44 

0.89 

0.00 

0.00 

13.37 

0.93 

0.00 

0.00 

5.07 

0.77 

0.01 

0.00 

15.41 

0.90 

0.00 

0.00 


OGK 

7.89 

0.89 

0.00 

0.00 

13.15 

0.93 

0.00 

0.00 

1.07 

0.51 

0.00 

0.00 

3.44 

0.77 

0.00 

0.00 


NPD 

0.76 

0.43 

0.00 

0.00 

1.37 

0.58 

0.00 

0.00 

1.11 

0.52 

0.00 

0.00 

1.25 

0.55 

0.00 

0.00 

Diagonal 

Kendall 

0.70 

0.44 

0.00 

0.00 

1.00 

0.50 

0.00 

0.00 

0.93 

0.48 

0.00 

0.00 

1.02 

0.50 

0.00 

0.00 


Spearman 

0.70 

0.44 

0.00 

0.00 

1.00 

0.50 

0.00 

0.00 

0.93 

0.48 

0.00 

0.00 

1.02 

0.50 

0.00 

0.00 


SpearmanU 

0.70 

0.44 

0.00 

0.00 

1.00 

0.50 

0.00 

0.00 

0.93 

0.48 

0.00 

0.00 

1.02 

0.50 

0.00 

0.00 


InvCov 

8.44 

0.76 

1.00 

0.00 

13.37 

0.85 

1.00 

0.00 

5.07 

2.12 

1.00 

0.00 

15.41 

0.92 

1.00 

0.00 


Table 2; Simulation results for seven estimators and four sampling schemes, when n = 200 and p = 120. Performance is measured by 
||S — X)*||oo for covariance matrix estimation (Cov), ||r2 — ri*||oo for precision matrix estimation (Prec), and false positive rate (FP) and 
false negative rate (FN) for support recovery of the true precision matrix. The results are averaged over 100 replications. 






clean 

5% rowwise 

10% rowwise 



Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 


SampleCov 

1.24 

0.33 


0.96 

5.98 

0.34 


0.85 

10.34 

0.35 


0.86 


DGK 

1.38 

0.34 


0.96 

2.20 

0.38 


0.95 

3.10 

0.41 


0.95 

Banded 

NPD 

1.64 

0.38 


0.99 

2.75 

0.40 


0.89 

3.95 

0.42 


0.89 

Kendall 

2.07 

0.37 


0.97 

2.76 

0.34 


0.85 

3.73 

0.35 


0.86 


Spearman 

2.07 

0.37 


0.97 

2.76 

0.35 


0.86 

3.73 

0.35 


0.86 


SpearmanU 

2.07 

0.37 


0.97 

2.76 

0.35 


0.86 

3.73 

0.35 


0.86 


SampleCov 

0.81 

0.44 

0.09 

0.56 

5.61 

0.43 

0.14 

0.73 

9.93 

0.40 

0.14 

0.74 


DGK 

0.96 

0.45 

0.09 

0.59 

1.86 

0.53 

0.09 

0.62 

2.87 

0.61 

0.10 

0.62 

Sparse 

NPD 

1.11 

0.59 

0.03 

0.79 

2.14 

0.63 

0.08 

0.93 

3.61 

0.68 

0.08 

0.95 

Kendall 

1.35 

0.50 

0.09 

0.60 

1.76 

0.48 

0.12 

0.77 

2.71 

0.47 

0.12 

0.79 


Spearman 

1.35 

0.50 

0.08 

0.60 

1.77 

0.49 

0.12 

0.77 

2.72 

0.49 

0.12 

0.79 


SpearmanU 

1.35 

0.51 

0.09 

0.60 

1.78 

0.51 

0.13 

0.77 

2.72 

0.51 

0.12 

0.79 


SampleCov 

0.69 

0.62 


1.00 

5.53 

0.62 


0.91 

9.90 

0.60 


0.91 


□GK 

0.78 

0.64 


1.00 

1.29 

0.69 


1.00 

1.92 

0.74 


1.00 

Dense 

NPD 

0.89 

0.65 


1.00 

1.54 

0.68 


0.93 

2.24 

0.72 


0.91 

Kendall 

1.17 

0.68 


1.00 

1.54 

0.65 


0.92 

2.12 

0.70 


0.91 


Spearman 

1.17 

0.68 


1.00 

1.54 

0.65 


0.92 

2.12 

0.65 


0.91 


SpearmanU 

1.17 

0.68 


1.00 

1.54 

0.66 


0.92 

2.12 

0.65 


0.91 


SampleCov 

0.34 

0.37 

0.00 

0.00 

5.28 

0.26 

0.09 

0.00 

9.64 

0.32 

0.09 

0.00 


DGK 

0.38 

0.38 

0.00 

0.00 

0.58 

0.36 

0.00 

0.00 

0.78 

0.44 

0.00 

0.00 

Diagonal 

NPD 

0.45 

0.32 

0.00 

0.00 

0.78 

0.37 

0.07 

0.00 

1.15 

0.44 

0.09 

0.00 

Kendall 

0.59 

0.72 

0.00 

0.00 

0.78 

0.60 

0.08 

0.00 

1.07 

4.83 

0.33 

0.00 


Spearman 

0.59 

0.72 

0.00 

0.00 

0.78 

0.60 

0.08 

0.00 

1.07 

0.57 

0.08 

0.00 


SpearmanU 

0.59 

0.72 

0.00 

0.00 

0.78 

0.59 

0.08 

0.00 

1.07 

0.56 

0.09 

0.00 


Table 3: Simulation results for six estimators and four sampling schemes, when n = 200 and p = 400. Performance is measured by 
||S — X)*||oo for covariance matrix estimation (Cov), ||r2 — ri*||oo for precision matrix estimation (Prec), and false positive rate (FP) and 
false negative rate (FN) for support recovery of the true precision matrix. The results are averaged over 100 replications. 






5% cellwise 

10% cellwise 

multivariate t 

alternative t 



Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 

Cov 

Prec 

FP 

FN 


SampleCov 

8.90 

0.48 


0.69 

13.70 

0.46 


0.44 

22.41 

0.45 


0.87 

137.82 

0.57 


0.86 


OGK 

8.79 

0.48 


0.66 

13.89 

0.46 


0.39 

3.97 

0.44 


0.95 

18.41 

0.51 


0.53 

Banded 

NPD 

4.04 

0.45 


0.98 

7.03 

0.45 


0.78 

5.03 

0.46 


0.94 

5.83 

0.48 


0.97 

Kendall 

2.89 

0.42 


0.96 

4.11 

0.46 


0.98 

3.69 

0.42 


0.96 

3.99 

0.43 


0.97 


Spearman 

2.89 

0.42 


0.96 

4.11 

0.46 


0.98 

3.69 

0.42 


0.96 

3.99 

0.43 


0.97 


SpearmanU 

2.89 

0.42 


0.96 

4.11 

0.46 


0.97 

3.69 

0.42 


0.96 

3.99 

0.44 


0.97 


SampleCov 

8.98 

0.91 

0.01 

0.96 

13.82 

0.85 

0.52 

0.45 

13.53 

0.79 

0.05 

0.80 

79.44 

0.96 

0.04 

0.85 


OGK 

8.83 

0.91 

0.02 

0.94 

14.48 

0.88 

0.57 

0.40 

3.83 

0.66 

0.10 

0.62 

12.48 

0.90 

0.07 

0.77 

Sparse 

NPD 

3.10 

0.72 

0.03 

0.83 

6.15 

0.82 

0.02 

0.87 

4.40 

0.76 

0.03 

0.84 

4.67 

0.78 

0.03 

0.86 

Kendall 

1.80 

0.64 

0.06 

0.74 

2.94 

0.74 

0.05 

0.82 

2.61 

0.63 

0.07 

0.69 

2.71 

0.66 

0.07 

0.67 


Spearman 

1.80 

0.65 

0.06 

0.74 

2.94 

0.74 

0.05 

0.82 

2.61 

0.64 

0.07 

0.69 

2.71 

0.66 

0.07 

0.67 


SpearmanU 

1.80 

0.65 

0.07 

0.73 

2.94 

0.75 

0.05 

0.82 

2.61 

0.64 

0.07 

0.68 

2.71 

0.66 

0.07 

0.67 


SampleCov 

8.96 

0.90 


0.96 

13.81 

0.85 


0.46 

12.64 

0.88 


0.99 

79.01 

0.98 


1.00 


OGK 

8.62 

0.90 


0.93 

13.64 

0.85 


0.38 

2.24 

0.76 


1.00 

10.33 

0.92 


1.00 

Dense 

NPD 

2.35 

0.77 


1.00 

4.28 

0.84 


1.00 

2.82 

0.79 


1.00 

3.22 

0.81 


1.00 

Kendall 

1.64 

0.72 


1.00 

2.29 

0.77 


1.00 

2.12 

0.75 


1.00 

2.25 

0.76 


1.00 


Spearman 

1.64 

0.72 


1.00 

2.29 

0.77 


1.00 

2.12 

0.75 


1.00 

2.25 

0.76 


1.00 


SpearmanU 

1.64 

0.72 


1.00 

2.29 

0.77 


1.00 

2.12 

0.75 


1.00 

2.25 

0.76 


1.00 


SampleCov 

9.03 

0.90 

0.00 

0.00 

13.93 

0.87 

0.47 

0.00 

6.33 

0.77 

0.01 

0.00 

39.73 

0.95 

0.00 

0.00 


OGK 

8.60 

0.90 

0.00 

0.00 

13.74 

0.87 

0.54 

0.00 

1.11 

0.52 

0.00 

0.00 

5.17 

0.84 

0.00 

0.00 

Diagonal 

NPD 

1.20 

0.54 

0.00 

0.00 

2.19 

0.69 

0.00 

0.00 

1.42 

0.58 

0.00 

0.00 

1.62 

0.62 

0.00 

0.00 

Kendall 

0.81 

0.52 

0.00 

0.00 

1.15 

0.54 

0.00 

0.00 

1.06 

0.52 

0.00 

0.00 

1.14 

0.54 

0.00 

0.00 


Spearman 

0.81 

0.52 

0.00 

0.00 

1.15 

0.54 

0.00 

0.00 

1.06 

0.52 

0.00 

0.00 

1.14 

0.54 

0.00 

0.00 


SpearmanU 

0.81 

0.52 

0.00 

0.00 

1.15 

0.54 

0.00 

0.00 

1.06 

0.52 

0.00 

0.00 

1.14 

0.54 

0.00 

0.00 


Table 4: Simulation results for six estimators and four sampling schemes, when n = 200 and p = 400. Performance is measured by 
||S — X)*||oo for covariance matrix estimation (Cov), ||r2 — ri*||oo for precision matrix estimation (Prec), and false positive rate (FP) and 
false negative rate (FN) for support recovery of the true precision matrix. The results are averaged over 100 replications. 




7 Discussion 


In this paper, we have derived statistical error bounds for high-dimensional robust precision ma¬ 
trix estimators, when data are drawn from a multivariate normal distribution and then observed 
subject to cellwise contamination. We show that in such settings, the precision matrix estimators 
that are obtained by plugging in pairwise robust covariance estimators to the GLasso or CLIME 
routine, as suggested by Oellerer and Croux (2014) and Tarr et al. (2015), have error bounds that 
match standard high-dimensional bounds for uncontaminated precision matrix estimation, up to 
an additive factor involving a constant multiple of the contamination fraction e. Our results for 
precision matrix estimators are derived via estimation error bounds for robust covariance matrix 
estimators, which have similar deviation properties. 

The results of our paper naturally suggest several venues for future work. As discussed earlier, 
our results seem to indicate that covariance estimators based on bounded-influence estimators of 
correlation and scale give rise to statistical error bounds of the form derived in our paper, and it 
would be interesting to rigorize this notion, as a further attempt to connect the fields of robust 
and high-dimensional statistics. It would also be interesting to relate the nonasymptotic statistical 
error bounds to the behavior of the sensitivity curve of the robust covariance estimator, which is the 
finite-sample analog of the influence function. We have also left open the question of calculating the 
breakdown point for the CLIME estimator with respect to more general data matrices, as well as the 
breakdown behavior of CLIME and GLasso under different notions of breakdown point. Although 
our results imply the superiority of the GLasso over the CLIME estimator from the perspective of 
the finite-sample breakdown point, this may only be part of the story. 

Lastly, it would be interesting to generalize our study to other classes of distributions. In one 
direction, it would be possible to study contaminated versions of other distributions besides the 
multivariate Gaussian, for which the precision matrix encodes information about the underlying 
graphical model (e.g., Ising models on trees). A harder question to tackle would be the problem of 
robust graphical model estimation in settings where the structure of the graph is not encoded in 
the precision matrix alone. Finally, one could consider robust estimation of scatter matrices, when 
the uncontaminated data are drawn from an elliptical distribution. In that case, the proposed 
Kendall’s tau and Spearman’s rho correlation coefficients would still be Fisher consistent upon 
taking the respective sine transformations, so similar error bounds should hold. As demonstrated 
in our simulation results, the pairwise covariance estimators based on Kendall’s tau and Spearman’s 
rho perform reasonably well when data are generated from either the multivariate t-distribution or 
the alternative t-distribution. This motivates studying the convergence rates of the same covariance 
matrix estimators under heavy-tailed or elliptical distributions. 

The problem of estimating high-dimensional covariance matrices under various structural as¬ 
sumptions has also been widely studied. Various families of structured covariance matrices have 
been introduced, including bandable matrices (Cai et ah, 2010), Toeplitz matrices (Cai et ah, 2013), 
and sparse matrices (Bickel and Levina, 2008; Cai and Zhou, 2012). The proposed covariance 
matrix estimators involve regularizing the sample covariance matrix in accordance to structural 
assumptions. It would be interesting to study robust versions of these structured covariance ma¬ 
trix estimators under a model such as cellwise contamination. Besides graphical models, covariance 
matrix estimation is also useful for statistical methods such as linear discriminant analysis and prin¬ 
cipal component analysis. Several high-dimensional procedures have been proposed with proven 
theoretical guarantees when data are uncontaminated (Cai and Liu, 2011; Vu et ah, 2013), and it 
would be interesting to study robust adaptations of these procedures, as well. 


31 


References 

Agostinelli, C., A. Leung, V. J. Yohai, and R. H. Zamar (2014, June). Robust estimation of 
multivariate location and scatter in the presence of cellwise and casewise contamination. arXiv 
e-prints. Available at http://arxiv.org/abs/1406.6031. 

Alqallaf, F., S. Van Aelst, V. J. Yohai, and R. H. Zamar (2009). Propagation of outliers in 
multivariate data. Ann. Statist. 57(1), 311-331. 

Alqallaf, F. A., K. P. Konis, R. D. Martin, and R. H. Zamar (2002). Scalable robust covariance and 
correlation estimates for data mining. In Proceedings of the eighth ACM SIGKDD international 
conference on Knowledge discovery and data mining, pp. 14-23. ACM. 

Anderson, T. (2003). An Introduction to Multivariate Statistical Analysis. Wiley Series in Proba¬ 
bility and Statistics. Wiley. 

Bickel, P. J. (1964). On some alternative estimates for shift in the p-variate one sample problem. 
Ann. Math. Statist. 35, 1079-1090. 

Bickel, P. J. and E. Levina (2008). Covariance regularization by thresholding. Ann. Statist. 55(6), 
2577-2604. 

Cai, T. and W. Liu (2011). A direct estimation approach to sparse linear discriminant analysis. J. 
Amer. Statist. Assoc. 755(496), 1566-1577. 

Cai, T., W. Liu, and X. Luo (2011). A constrained £i minimization approach to sparse precision 
matrix estimation. J. Amer. Statist. Assoc. 755(494), 594-607. 

Cai, T. T., Z. Ren, and H. H. Zhou (2013). Optimal rates of convergence for estimating Toeplitz 
covariance matrices. Probab. Theory Related Fields 755(1-2), 101-143. 

Cai, T. T., C.-H. Zhang, and H. H. Zhou (2010). Optimal rates of convergence for covariance 
matrix estimation. Ann. Statist. 55(4), 2118-2144. 

Cai, T. T. and H. H. Zhou (2012). Optimal rates of convergence for sparse covariance matrix 
estimation. Ann. Statist. 40{5), 2389-2420. 

Chen, M., C. Gao, and Z. Ren (2015). Robust covariance matrix estimation via matrix depth. 
arXiv preprint arXiv:1506.00691. 

Croux, C. and C. Dehon (2010). Influence functions of the Spearman and Kendall correlation 
measures. Statistical Methods & Applications 75(4), 497-515. 

Donoho, D. and P. J. Huber (1983). The notion of breakdown point. In A Festschrift for Erich L. 
Lehmann, Wadsworth Statist./Probab. Ser., pp. 157-184. Wadsworth, Belmont, CA. 

Donoho, D. L. (1982). Breakdown properties of multivariate location estimators. Technical re¬ 
port, Technical report. Harvard University, Boston. URL http://www-stat. Stanford. edu/~ 
donoho/Reports/Oldies/BPMLE. pdf. 

Finegold, M. and M. Drton (2011). Robust graphical modeling of gene networks using classical and 
alternative t-distributions. Ann. Appl. Stat. 5(2A), 1057-1080. 


32 


Friedman, J., T. Hastie, and R. Tibshirani (2008). Sparse inverse covariance estimation with the 
graphical lasso. Biostatistics 9 (3), 432-441. 

Gnanadesikan, R. and J. R. Kettenring (1972). Robust estimates, residuals, and outlier detection 
with multiresponse data. Biometrics 28(1), 81-124. 

Hampel, F., E. Ronchetti, P. Rousseeuw, and W. Stahel (2011). Robust Statistics: The Approach 
Based on Influence Functions. Wiley Series in Probability and Statistics. Wiley. 

Hampel, F. R. (1974). The influence curve and its role in robust estimation. J. Amer. Statist. 
Assoc. 69, 383-393. 

Han, F., J. Lu, and H. Liu (2015). Robust scatter matrix estimation for high dimensional distri¬ 
butions with heavy tails. Technical report, Technical report, Princeton University. 

Higham, N. J. (2002). Computing the nearest correlation matrixa problem from finance. IMA 
journal of Numerical Analysis 22(3), 329-343. 

Hoeffding, W. (1948). A class of statistics with asymptotically normal distribution. Ann. Math. 
Statistics 19, 293-325. 

Hsieh, C.-J., I. S. Dhillon, P. K. Ravikumar, and M. A. Sustik (2011). Sparse inverse covariance 
matrix estimation using quadratic approximation. In J. Shawe-Taylor, R. S. Zemel, P. L. Bartlett, 
F. Pereira, and K. Q. Weinberger (Eds.), Advances in Neural Information Processing Systems 
24 , pp. 2330-2338. Curran Associates, Inc. 

Huber, P. J. (1964). Robust estimation of a location parameter. Ann. Math. Statist. 35, 73-101. 

Huber, P. J. (1981). Robust statistics. John Wiley &: Sons, Inc., New York. Wiley Series in 
Probability and Mathematical Statistics. 

Kendall, M. G. (1948). Rank correlation methods. Griffin. 

Kruskal, W. H. (1958). Ordinal measures of association. Journal of the American Statistical 
Association 55(284), 814-861. 

Lauritzen, S. L. (1996). Graphical Models. Oxford University Press. 

Little, R. J. A. and D. B. Rubin (1986). Statistical Analysis with Missing Data. New York, NY, 
USA: John Wiley & Sons, Inc. 

Maronna, R. A. (1976). Robust M-estimators of multivariate location and scatter. Ann. 
Statist. 4(^)i 51-67. 

Maronna, R. A. and R. H. Zamar (2002). Robust estimates of location and dispersion for high¬ 
dimensional datasets. Technometrics 44(^)-, 307-317. 

Oellerer, V. and C. Croux (2014). Robust high-dimensional precision matrix estimation. Available 
at SSRN 2528996. 

Puri, M. L. and P. K. Sen (1971). Nonparametric methods in multivariate analysis. John Wi¬ 
ley &: Sons, Inc., New York-London-Sydney. 


33 



Ravikumar, P., M. J. Wainwright, G. Raskutti, and B. Yu (2011). High-dimensional covariance 
estimation by minimizing £i-penalized log-determinant divergence. Electron. J. Statist. 5, 935- 
980. 

Ren, Z., T. Sun, C.-H. Zhang, and H. H. Zhou (2015, 06). Asymptotic normality and optimalities 
in estimation of large Gaussian graphical models. Ann. Statist. 4^(3), 991-1026. 

Rousseeuw, P. (1985). Multivariate estimation with high breakdown point. In Mathematical statis¬ 
tics and applications, Vol. B (Bad Tatzmannsdorf, 1983), pp. 283-297. Reidel, Dordrecht. 

Rousseeuw, P. J. (1984). Least median of squares regression. J. Amer. Statist. Assoc. 79(388), 
871-880. 

Rousseeuw, P. J. and G. Groux (1993). Alternatives to the median absolute deviation. J. Amer. 
Statist. Assoc. 88{4:2A), 1273-1283. 

Serfling, R. and S. Mazumder (2009). Exponential probability inequality and convergence results 
for the median absolute deviation and its modifications. Statist. Probab. Lett. 79(16), 1767-1773. 

Shevlyakov, G. and N. Vilchevski (2002). Robustness in Data Analysis: Criteria and Methods. 
Modern Probability and Statistics, 6. VSP. 

Smith, S. M., K. L. Miller, G. Salimi-Khorshidi, M. Webster, G. F. Beckmann, T. E. Nichols, J. D. 
Ramsey, and M. W. Woolrich (2011). Network modelling methods for FMRI. Neuroimage 54 (2), 
875-891. 

Stahel, W. A. (1981). Breakdown of covariance estimators. Fachgruppe fiir Statistik, Eidgenossische 
Techn. Hochsch. 

Swanson, D. (2000). Signal Processing for Intelligent Sensor Systems. Signal Processing and 
Gommunications. GRG Press. 

Tarr, G., S. Muller, and N. G. Weber (2015). Robust estimation of precision matrices under cellwise 
contamination. Computational Statistics & Data Analysis. 

Troyanskaya, O., M. Gantor, G. Sherlock, P. Brown, T. Hastie, R. Tibshirani, D. Botstein, and R. B. 
Altman (2001). Missing value estimation methods for DNA microarrays. Bioinformatics 17(6), 
520-525. 

Tukey, J. W. (1962). The future of data analysis. Ann. Math. Statist. 33, 1-67. 

Van Aelst, S. (2014). Stahel-Donoho estimation for high-dimensional data. International Journal 
of Computer Mathematics (ahead-of-print), 1-12. 

Vu, V. Q., J. Cho, J. Lei, and K. Rohe (2013). Fantope projection and selection: A near-optimal 
convex relaxation of sparse pea. In Advances in Neural Information Processing Systems, pp. 
2670-2678. 

Werhli, A. V., M. Grzegorczyk, and D. Husmeier (2006). Gomparative evaluation of reverse 
engineering gene regulatory networks with relevance networks, graphical gaussian models and 
bayesian networks. Bioinformatics 22(20), 2523-2531. 

Yuan, M. and Y. Lin (2007). Model selection and estimation in the Gaussian graphical model. 
Biometrika 94(1), 19-35. 


34 



Zhang, F. (2011). Matrix Theory: Basic Results and Techniques. Universitext. Springer. 

Zhao, T., H. Liu, K. Roeder, J. Lafferty, and L. Wasserman (2012). The huge package for high¬ 
dimensional undirected graph estimation in r. J. Mach. Learn. Res. 13, 1059-1062. 


35 



A Proofs of supporting lemmas 


In this Appendix, we provide the proofs of the technical lemmas used to establish Theorems 1 and 2 
in Section 3.1. 


A.l Proof of Lemma 1 

When i = j, we have 
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is a [/-statistic, and the last inequality follows from the fact that cos(x) is 1-Lipschitz. By Hoeffd- 
ing’s inequality for [/-statistics, we have 
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Now, consider the case where i ^ j- Note that 
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where pf^ = E{rP) and the expectation is with respect to the distribution under model (1). Since 




is a [/-statistic with kernel bounded between —1 and 1, Hoeffding’s inequality and the fact that 
sin(x) is 1-Lipschitz implies that the first term on the right-hand side of inequality (41) satisfies 
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Combining inequalities (40) and (42) and taking t = we conclude that with probability 
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For the second term on the right-hand side of equation (41), we have under model (1) that for 
any pair i / j, 


i.i.d. 


{Xki,Xkj) ~ Fij — [1 


Hj, 


yi < k < n, 


(44) 


where 4)^^. Sjjj}) is the marginal distribution of {Yki,Ykj), Hij is a mixture of 

the distributions of Y^i, Y^j, Zki, and and 1 — 7ij = (1 — ej)(l — ej). 

By Lemma 12, we have pfj = sin“^ + Rij, where \Rij\ < 127 *^ -|- 17 Setting 
we then have 

= |sin (sin“^(pij) R^j) - Pij\ 

= |sin(sin"^(pjj)) cos{Rij) + cos(sin“^(pjj)) sin(i?'j) - p^^l 


■ k\ 


sm [^Pij j 

Pij 


= Pij cos{Rij) + y 1 - sm{R'ij) - Pij 

< \pij (l - cos(i2'y) I -h yj 1 - pfj sin(i?'y 

< [l-cos(i2'y] -h I sin(i?'y|. 

Note that 7jj = ei + ej — eiEj < 2e, so 

|i?y < ^( 127 i,- + 177|,) < I (12 • 2e + 17(2e)y = 127re + Sdvre^. 


In particular, this bound is less than 1 when e < 0.02. Then using the fact that | sin(a:) — a:| < 
2 

and |1 — cos(x)| < for |x| < 1, we conclude that 


max 

i<7i<p 


vr 


K 


Sin ( -P^J 


'ij 


< max 


, , , iR7)^ \R'iA^ 

1^7-1+ Mf^ + 


6 


< 2 max \RA\ < 267re 




'7' — 


(45) 


Combining inequalities (43) and (45) then proves the desired result. 


A.2 Proof of Lemma 2 

Under model (1), we have the marginal distributions 

Xki ■ Fi = {1- + eiHi, VI < /c < n, 

for each 1 < i < p, where = N{pi,af) is the marginal distribution of Yfcj and Hi is the 

marginal distribution of 

Let d{Fi) and fi(4>^._o-i) denote the population MADs corresponding to Fi and respec¬ 

tively. Since di = [<I)“^(0.75)]“^(lj and ai = [‘h“^(0.75)]“^d(4>^.^o-J, with J* defined as in equa¬ 
tion (3), it suffices to bound the term \di — fi(4>^.^o-i)|) which we decompose as follows: 

\d^ - d(4>^„.J| < Id. - d{Fi)\ + \d{Fi) - d(<h^„.J|. 
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By Lemma 11, for 0 < f < 1, 


P( m<^ \di - d{Fi)\ >t)< - d{Fi)\ > t) 

i=i 

< 6p max |exp(—2nc^(crj)f^)| 

i<i<p ' 'J 


= 6pexp —2n min c^(ai)t^ 

l<i<p 


Let t = $ ^(0.75)C"< 1. With probability at least 1 — 6p H0-75)]2c'2mmi<i<pc2(o-i) i}^ 
we then have 

max \di — d{Fi)\ < <1>~^(0.75)C'''\/— 
i<i<p V n 

On the other hand, by Lemma 10, we have 


max \d{Fi) - < 4.8 max aiei < A.SM^e. 

l<i<p 


Thus, with probability at least 1 — 6p h0-75)]2c'2mmi<i<pc2(o-i) i}^ 


max \di — (i(<h 

1<2<P 


Mi ) 


< <^-^{0.75)0' 



+ A.SMae. 


It follows that with the same probability, 


max \ai — ai 
l<i<p 


[^■n0-75)]“^ 


max \di — d(4> 

1<2<P 




< C 



+ 7.2M^e. 


A.3 Proof of Lemma 3 

When i = j, we have 2sin(^r^) = pu = 1; hence, we only need to consider the case when i ^ j. 
First, note that 


2 sin ( ) - p 




< 2 


vr a 

sm I 


TT 


— sin ( —Fir. 
6 


+ 


2sin( 


'IJ 


(46) 


where the expectation is taken with respect to the distribution under model (1). By Lemma 14, 

r*‘S' — ^ ^ TJ■ . TT. . 

Id ~ n+i^*J n+i'jj’ wneie Uij 

. . i< 

^3 


we have rf- = where Uij is a tZ-statistic with kernel bounded between —3 and 3, 

and is the Kendall’s tau correlation. Using the fact that sin(x) is 1-Lipschitz, we then have 


P[ 2 


TT a 

sm I -r,^. 


vr 


— sin ( —Eir: 
6 


>t) <P[\rf^-E{rf^)\>^-^ 


= P 


^ - E{U,^)) + - pf^) 


n + 1 


n + 1' 


TT 


<p{\U^j-E{Uij))\ + 
<p{\U,j-E{Ui,)\>^., 


6 3t 

- > — 

n + 1 TT 
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where the last inequality follows from the choice t = and the fact that when 


n > —• Furthermore, Hoeffding’s inequality implies 


P( \Uij - EiUij)\ > 1^) < 2exp f-2 


n 1 \ / nt^\ 

LsJUJ 6^ 


'327r2 ) ■ 


Plugging in t = and using a union bound, we then have 


P I max 2 
i<hi<p 




^<2/exp(-^)=2p-{i^-}. 

(47) 

For the second term on the right-hand side of equation (41), we have under model (1) that for any 
pair j, 

{Xki,Xkj) ~ Fij = (1 — \/l < k < n, 

where marginal distribution of {Yki,Ykj), Hij is a mixture of 

the distributions of Y^i, Y^j, Z^i, and Z^j, and 1 — 7ij = (1 — ej)(l — ej). 

By Lemma 13, we have E{rfj) = | sin“^ (^) -|- Rij, where \Rij\ < AS'jij + 1297?- -|-88 ^fj -|- 
Setting R'-j = ^Rij, we then have 


2sin (^-^(rg)j - Pij =|2sin(sin \pij/2) + R'^) - Pij\ 

= |2sin(sin“^(pij/2))cos(i?F) -h2cos(sin“^(Pij/2))sin(i?F) -PijI 
= Pij cos{R'ij) + 2^Jl- p\-l^ ■ sm{R'ij) - p^j 

< \pij (l - cos(i?F)) I -h 2 pj./A ■ sin(i2F) 

< [l - cos(i?F)] 2| sin(i2F)|. 

Note that ^ij = et + ej — ettj < 2e, so 

\R'iA < 5 f487ij 12972 - -h 88-i% + 


^27 - ' n + 1, 

< - 1 48 • 2e 129(2e)2 88(2e)^ -h 

6 V n -|- 1 


< IGvre -|- 867re2 -|- 1187re^ -|- 


27r 


n -|- 1 

In particular, this bound is less than 1 when e < 0.01 and n > 15. Then using the fact that 
I sin(x) — x\ < and | cos(x) — 1| < §]- for |x| < 1, we conclude that 


max 

i<hi<p 


TT 


2sin(-F;(r,^. 


'IJ 


< max 

l<i j<p 


2|-Rij| + 2 


{Rijj 


./ ^2 1^/ |3 


+ 


< 3 max 


< 487re -h 2587re2 3547re^ -h 


Gvr 

n -|- 1 


<51,.+ 

2 V n 
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^ R 2 

where the final inequality uses the assumption n > once more. Combining this bound with 

inequality (47) implies the desired result. 


B Lemmas for MAD concentration 


In this Appendix, we prove several lemmas that are needed in deriving consistency of the MAD 
estimator. We begin with some results concerning the concentration of sample medians from an 
arbitrary distribution. A version of Lemmas 7 and 8 is also contained in Serfling and Mazumder 
(2009). 

Lemma 6. Let Xi,... ,Xn be a random sample from a distribution with cdf F, and let m be the 
sample median. If m < c, then \{Xi : Xi < c}| > If m > c, then \{Xi : Xi < c}| < 

Proof. This result follows easily from the definition of the sample median. □ 

Lemma 7. Let Xi,... ,Xn be a random sample from a distribution F. Let m be the population 
median and let m be the sample median. Then 


P( |m — m| > ^ ) < 2exp(—2n6^(t)), 


where b{t) = min {F(m + ^ — F{m — |)}. 

Proof. By Lemma 6, 


P[ m > m + ^ ■ Xi < m + ^ 


INI) 


2=1 


pk:i <5 


n 


P[ Y^iYi-EY,) <--npi 
2=1 


= exp 


, 1\2 
- 2 n{pi - - 


(48) 


where T) = l{Aj <m + |} and pi = F{m + |), and the last inequality follows from Hoeffding’s 
inequality. Similarly, we have 


< m - 0 - ■ Xi < m - 


2 

2=1 


P[ YiEi-EZi) > 


n 


np2 


i=l 


< exp 


, 1 \ 2 ' 
- 2n{p2 - - 


(49) 
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where Zi = || and p 2 = F{m — Combining expressions (48) and (49), we then 

obtain 



/ 1\2 


/ 1\2 

< exp 

-2n(pi-2) 

+ exp 

-2n(p2-2) 


< 2exp(—2n6^(t)). 


□ 


Lemma 8. Let he a random sample from a distribution with cdf F. Let m and d 

denote the population median and MAD, respectively, and let m and d denote the sample median 
and MAD. Let G he the distribution of \Xi — m\. Then 


P{\d — d\ > t) < 6exp(—2na^(t)), 


where 


(50) 


Proof. Let Wi = \Xi — m\. By the definition of the sample MAD, Lemma 6 gives 
P{d >d + t)<p(^\{Wi:Wi<d + t}\<'^'^ 

= p(^|{Ai : \Xi-m\<d + t}\ < 

< P^{Xi : \Xi — m\ < d + t}\ < and |m — m| < + P^|m — m| > ^ 

< jXi : |Xi - m| < d + + P^|m - m| > 0 

= P^^l||Aj-m| <d+^| < + P^m-m\ > ^ 

= p(^Y2{Yi- EYi) < I -np3^ +p(^\m-m\ > 0, 

where Yi = 1 {\Xi — m\ < d+ andpa = G{d+^). Then by Hoeffding’s inequality and Lemma 7, 
the last quantity is bounded by 


exp 


- 2n(p3 - 


+ 2 exp(—2n6^(t)). 


(51) 
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Similarly, 


n 


P{d < d -t) < P[\{Wi -.Wi < d - t}\> - 


= P 

< P 

< P 


■■ \Xi - rh\ < d - t]\ 

^|{Xj : \Xi — m\ < d — t}\ > —, and |m — m| < + P^|m — m| > 

^Xi : - m| < d - 


— I + P( |m — m| > - 


i=l 

n 


Pi'^l\\Xi-m\<d--\>^\+p{\rh-m\>- 


P[ ^ - np4 I + PMm - m| > - I, 


i=l 


where Zi = l {\Xi — m\ < d — and p 4 = G{d— |). By Hoeffding’s inequality and Lemma 7, the 
last quantity is upper-bounded by 


exp 


- 2n(p4 - 


-b 2 exp(—2n6^(t)). 


Combining expressions (51) and (52) then yields 
P{\d — d\ > t) < 4exp(—2n6^(t))-bexp 




/ 1\2' 

-2n(p3-2) 

+ exp 

-2n(p4-5j 


(52) 


< 6exp(—2na^(t)). 


□ 


Next, we prove two population-level lemmas for the e-contamination model. As remarked in 
the introduction, we use the notation F“^(c) = inf{x : F{x) > c}, which is defined even if the cdf 
F is not surjective on the interval [0,1]. Note that Lemmas 9 and 10 do not impose any conditions 
on the contaminating distribution H. 

Lemma 9. Let F = (1 — e)<h^^o- + where denotes the distribution and H is an 

arbitrary distribution. Let := be the standard normal cdf and suppose that 0 < e < 1. Then 




(53) 


Proof. Let F = {I — e)«b^^a- + ePI. Then 


F 4> 


3-1 

-11,(7 


1 - e 


= (l-e)<h, 


/i.o-1 '^11,a 


$ 


1 - e 


+ €H{^ 


)-i 

• /i ,(7 


1 - e 


>(l-e).^ = c, (54) 

where by a slight abuse of notation, we use F and H to denote the cdfs of the corresponding 
distributions. In addition. 


1 -F 4> 


.-1 


c — e 


M.<^Vl - 


= (l-e) 

> (1 - e) 


1-^,^1-c. 


+ e 


1-H ^ 


.-I 


c — e 


M-<^Vi - 


(55) 
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Combining equations (54) and (55), and using the facts that F is monotonically increasing, we then 
obtain the desired bound (53). Note that the outer equalities hold since 4>“^(x) = □ 

Lemma 10. Let F = [1 — e)<l>^_o- + eH, where denotes the N{n,a'^) distribution and H is 
an arbitrary distribution. Suppose 0 < e < Let d{F) and d(4>^_o-) denote the population MADs 
corresponding to F and respectively. Then 

|d(F)-d($^,,)| <4.8ae. 


Proof. By an abuse of notation, we also use F to denote the cdf of the contaminated distribution. 

Then F~^ is the quantile function. Note in particular that the following statements hold, where 

X ~ T, as an easy consequence of the definition of F~^: 

(i) d{F) < a if P{\X - F-^{0.5)\ < a) > 0.5, 

(ii) d{F) > a if P{\X - F-\0.5)\ < a) < 0.5. 

Furthermore, we may write 

P{\X - F-^(0.5)| < a) > (1 - e) • P{\Z - ^"^(0.5)1 < a) 

= (1 - e) (T-Ho. 5) + a) - (^-^(0.5) - a)} , 

where Z ^ N{yL, cr^). By Lemma 9, the last expression is further lower-bounded by 


(1-e) ch 


. fi,a 


$ 


-1 

/j.,cr 


0.5 - e 
1 - e 


a - 4> 


/i,cr 


4- 


-1 

/i,C7 


0.5 
1 - e 


— a 


We will take 


0 = 4) 


-1 

11,(7 


0.75 
1 - e 


- 4- 


-1 

/I,a 


0.5 - e 
1 - e 



- 4> 


-1 

fi,a 


0.25 - e\ 


where the second inequality comes from the fact that 4>^|,.(6) = —4>^],.(1 — b). 
bound becomes 


(1-e) 


0.75 
1 - e 


0.25 - e\ 
1-e ) 


> 0.5. 


Then the lower 


Putting the bounds together, we have 


P(|X-F-^(0.5)| <o) >0.5, 


so by the implication (i) above, it follows that 


/ 0-75 A /0.5-e 

d{F) < 4>„i I ^- 1 - 4> ^ 


\l-ej V 1-e 

Similarly, we may derive a lower bound on d{F) by writing 

p-i 


P{\X - F-^(0.5)| > a) > (1 - e) • P{\Z - ^-^(0.5)1 > a). 


where Z ~ N{yL,a‘^). Furthermore, 

P{\Z - F-\0.5)\ <a) = (F-^(0.5) + a) - {F-\0.5) - a) 


< 4>„ ^ $ 


^-1 


1 1 _ 


0.5 ^ ^ ^ 1 /0.5-e 

' + a 4--1 


I i_ 


(56) 


a , 
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using Lemma 9. Taking 


a = 


-1 


we then have the bound 


0.75-2e\ 

- ‘h, 


0.5 


^ 1 _ 


= $ 


-1 


IJ; 0 - \ _ 


0 - 5 - e '' ^-1 


0.25 


V 1 - 2e 


P{\Z - F-\0.5)\ <a)< 


0.75 - 2e 0.25 0.5 - 2e 


1 - 2e 1 - 2e 1 - 2e ’ 


implying that 


It follows that 


P{\X - F-\0.5)\ > a) > (1 - e) • ( 1 - 


0.5 - 2e 


1 - 2e 


> 0.5. 


P(|X-F-^(0.5)| <o) <0.5, 


so by implication (ii) above, 


d{F) > $ 


-1 

/j.,cr 


a75^\_ , 

1 - 2e J 
-1 


0.5 
1 - e 


(57) 


Using the fact that = <h^ ),.(0.75) and <I>^ ),.(0.5) = 0, inequality (56) implies that 


diP) - - $-),(0.75)| + |a>-),(0.5) - ch-), 


0.5 - e 


— e 


< 3.6(7 

= 3.6(7 

< 4.8(76, 


0.75 
1 - 6 
1.25e 


-0.75 + 0.5- 


0.5 - 6 
1 - 6 


1 - 6 


where the second inequality comes from Lemma 16 and the observation 4>^ ),.(x) = /i + (7<hg J(x), 
along with the assumption e < Similarly, inequality (57) implies that 


diP) - d(4>^,,) > $ 




0.75 - 2e 
1 - 2e 


-$;;,(0.75)|> + <’cI>-j,(0.5)-4> 




^-1 


0.5 
1 - 6 


, , 0.75 - 2e\ / 0.5 

> -3.6ct <! ( 0.75 - ^ 1 + 1 -0.5 


1 - 2e 


1 - 6 


= —3.6(7 
> -3.98(76 

Thus, we have the desired result. 


0.56 0.56 

+ 


1-26 1-6 


□ 


We conclude with the main lemma of this section, which establishes the consistency of the 
sample MAD to its population-level version. 

Lemma 11. Let Xi,... be a random sample from F = (1 — 6)$^^^ + where 0 < 6 < 
^^,cr denotes the N{p,,a‘^) distribution, and H is an arbitrary distribution. Let d := d{F) be the 
population MAD corresponding to F, and let d be the sample MAD. Then for 0 < t < 1, we have 

P{\d — d\ > t) < 6exp(—2nc^((T)t^), (58) 
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Proof. By Lemma 8 , it suffices to show that 

a(0 ^ c{a)t, 

for the e-contaminated distribution, with a{t) as defined in the lemma. With an abuse of notation, 
let F, and H denote the cdfs of the respective distributions. Let 

G{c) = P{\Xi -m\ < c), 

where m denotes the median of the contaminated distribution. Note that by the definition of the 
median, we have F{m) > ^ and G{d) > Define 

bi = F(m + ^ > F(m + - F{m), 

63 = + -^ — - > G^d + -^ — G{d), and 

where we have used the fact that F {m — |) < 2 G (d — |) < ^ in the second and fourth 
inequalities. Then a{t) = min{ 6 i, 62 ) ^ 3 ; fe 4 }- 
Note that 


> (1 - e) + e + 0 - H{m)^ 

> (1 - e) • 


Similarly, we can check that 

62 > (1 - e) ( $^, 0 - 


m - - ] - I m - - 


t 


b3>{l-e) (^G$ -G$(d)J , 

hA > (1 -e) (Gd- (d - 7 ) -G$ ( d + 


2 
and 


where G^{c) := + c) — — c). By the mean value theorem, we have ci, C2, C3, and C4 

such that 


hi > 

62 > (1 - e)^>'^^„(c2)^, 

63 > (1 - e)G$(c3)^ = (1 - e) + C3) + - C3)) 

64 > (1 - e)G$(c4)^ = (1 - e) + C4) + - C4)) 


m < Cl < m + -, 

t t 

m - <C2 < m -, 

2 “ 4’ 

t 

d < C3 < d + -, 

7 t , t 

d--<C4<d--. 
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Note in particular that 


Cl, C2, m + C3, m — C3, m + C4, m — C4 G 


, t , t 

m - d- m + d+ - 


Let d{^fj,^a) = ^(0.75)cr be the MAD estimator corresponding to By Lemma 9, for 

0 < e < the median m = F~^{0.5) satisfies 

-i/l - 2 e\ 


" +U j" - "+s ” s ^ (2^)" s (15 j" 

In addition, Lemma 10 implies that for 0 < e < 3 ^, we have 

d < d(d>^,o-) + 4.8(76 < 4>“^(0.75)(7 + 0.3(7 < a. 

Therefore, for c G [m — d — m + d + ^] and 0 < t < 1, we have 


c> m — d -+ — ]a — a — 0.5 > fj, — 1.1a — 0.5, and 

2 \15J 

c < m + d — < fJ. + 4*”^ I — )cr + (7 + 0.5 < /r + l.lcr + 0.5. 

2 \15J 

min |$'^g.(c) ■. m — d — ^ <c<m + (i+^| > min{<h'^ ^^.(c) : |c — /i| < 1.1a + 0.5} 

(I.I 0 - +0.5)2' 


7 


Hence, 




TTfJ 


exp 


2 ct2 


It follows that 


a{t) = min{ 6 i, 62 , h, ^ 4 } > (1 - e) 


\/^i 


vra 


exp 


(1.1(7 + 0.5)2^ t 
2^2 


> 


15 


16\/^( 


exp 


TTCr 


(l.la + 0.5)2\ t 


2ct2 


- = c(a)t. 


□ 


C Auxiliary lemmas 

We begin with a lemma describing the behavior of the mean of the Kendall’s tan statistic under a 
contaminated normal distribution. Note that the statement of the lemma does not depend on the 
variances of the uncontaminated marginals, or the contaminating distribution H. 

Lemma 12. Let (A^i, Afc 2 ), for k = 1,... ,n, be a random sample from 

F = {1- 7)4>p + 7 iL, 

where is a bivariate normal distribution with correlation p and H is an arbitrary bivariate 
distribution. Let = Ep{r^), where r^ is Kendall’s tau statistie. Then 

p^ = - sin“^(p) + R, 

TT 

where |i?| < I 27 + 177^. 
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Proof. Define a{X) = 1{X > 0), and let sign(X) = 2a{X) — 1. In particular, 


sign(X) = 1 {X > 0 ) - 1 {X < 0 ) = 2 a{X) - 1 - 1 (X = 0 ) = sign(X) - 1 {X = 0 ). 
We may rewrite as 

= E [sign(Xii - X2i)sign(Wi2 - X22)] 

= E [^(Xn - X2i)^(Xi2 - X22)] - E[l{Xii = X2i)^(Xi2 - X22)] 

- E [^(Xn - X 2 i)l(Xi 2 = X22)] + E [l(Xn = W 2 i)l(Xi 2 = X22)] 

:= A + B + C + D. 


In particular, 

\B\ = \E[l{Xn = X2i)^(Xi2 - X22)]| < E[l{Xn = ^21)] = ^(^11 = ^21), ( 59 ) 

using the fact that |sign(X)| = 1 . Furthermore, we have 

P{Xn = X21) < 7", 

since the normal distribution is absolutely continuous, so we can only have P{Xii = X21) with 
positive probability when both Xi and X2 are drawn from the contaminating distribution. Similarly, 

\c\ = |f;[^(Xii - X2l)l(Xi2 = X22)]| < E[1{Xi 2 = X22)] = P{Xl2 = X22) < 1^. (60) 

We also have 

\D\ = |^[l(Xii = X2l)l(Xi2 = X22)]| < (^[l(Xii = X 2 l)]f'^ {E[ 1 {Xi 2 = X 22 )]f'^ < ■ ( 61 ) 

Turning to the final term, we have 

A = E [sign(Xii - X2i)sign(Wi2 - X22)] 

= E[[ 2 a{Xii - X21) - l){ 2 a{Xi 2 - X22) - 1)] 

= AE[a{Xii - X2i)a{Xi2 - X22)] - 2 E[a(Xn - W21)] - 2 E[a{Xi 2 - W22)] + 1 
= ( 4 F;[o(Xn - X 2 i)a{Xi 2 - ^22)] - l) + 2 (l - E[a{Xii - X21)] - ^[a(Xi 2 - X22)]) 

:= ^ 1 +^ 2 . 


Here, the expectation is with respect to the joint distribution of (Xu, X 12 , W 21 , X 22 ), with density 

/ = [(!- 7)0^ + 7/11] [(1 - 7)(/>2 + 7/^2] 

= (1 - 'yf(l)i(p2 + 7(1 - l)4>ih2 + 7(1 - l)<p2hi + 7^hih2. (62) 


This follows from the fact that the pairs (Xii,Xi 2 ) and (X 2 i,X 22 ) are independently drawn from 
the mixture distribution, where (j) is the joint density of {X^i, Xk 2 ) under <hp, and h is the joint 
density of {Xki,Xk 2 ) under H. Now, let U = Xu — X 21 and V = X 12 — X 22 - Under the product 
distribution 0i02) the distribution of {U,V) is bivariate normal with mean 0 and correlation p. 
Hence, 


and by Lemma 15, 


E^^^,[a{U)a{V)] 


1 

4 


1 + 


2 

TT 


sin ^{p) 


(64) 
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Combining equations (62) and (63), we then have 
Ef[a{U)] 

= (1 - [a{U)] + 7(1 - -f)E^^h2 [a{U)] + 7(1 - HU)] + HEh^h2 HU)] 

= ^ - 7 + ^7^ + 7(1 - l)E^ih 2 HU)] + 7(1 - 7)^02/11 [«(t^)] + HEhrh 2 HU)] 


1 


1 


— 2 + {“1 + ^(t>ih2HU)] + E^2hiHU)]} I 2 ~ ^^iH^iU)] — E^2hiHU)] + Ehih2HU)] f 7 
Noting that [a([/)], Ja([/)] and [a(E)] are between 0 and 1, we have 


EfHU)] - 2 


3 2 

<7+277 


and 


EfHV)] - 2 


3 2 

< 7 + 27 • 


It follows that 


iTlsI = 2|1 - Ef[a{U)] - E;/[a(C)]| < 47 + 67^. 
On the other hand, combining equations (62) and (64), we have 

Til =4E/[a(E)a(I7)] - 1 

= 4{(1 - ^)^E^,^,[aiU)a{V)] + 7(1 - 7)E^,h2HU)a{V)] 


(65) 


= (1-7)' 


+ 7(1 - 'y)E^^hi HU)a(y)] + HEh^h2 HU)a 
2 


- 1 


1 + - sin {p) 


TT 


- 1 


+ 4{7(1 - l)E^,h2 HU)a{V)] +j{l- j)E^,h, HU)a{V)] + HEh,h2 HHa 


= -sm ^(p) + (-27 + 7 ) 


TT 


1 + - sin ^(/}) 


TT 


+ 4(7(1 - 7 )E^,h 2 HU)a{V)] + 7(1 - 7 )E^,h, HU)a{V)] + HEh,h2 HHa 

= -sin"^(p) + I - 2 - -sm~^{p) + 4E;0^/i 2[a(E)a(I7)] + AE^^hiHU)aiV)]\j 
TT TT J 


+ a + - sin-'(p) - iE^,h,[a{U)a(V)] - 4E^,h,HU)a{V)] + iEh,h2HU)a{V)] W 


TT 


Noting that the quantities 


-2 - -sin-'(p) +4^^,fc,[a(E)a(C)] +4E;^,;,Ja(E)a(y)] 


and 


1 + ^ sin-^ (p) - 4E^,h2 HU)a{V)] - AE^,h, HU)a{V)] + 4Eh,h2 HU)a{V)] 

are both bounded in magnitude by 8, we obtain 

2 


Ai - +sin ^(p) 

TT 


<87 + 87 . 


( 66 ) 
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Combining inequalities (59), (60), (61), (65) and (66) then gives 


p^ - - sin ^{p) 

_ 

A\ + A 2 B C ^ D - sin ^ {^p) 

TT 


TT 


< 


-sin ^{p) 

TT 


+ \A2\ + \B\ + \C\ + \D 


< 127 + 177^. 


□ 

The second lemma provides an analogous result to Lemma 13, this time for the Spearman’s rho 
statistic. 

Lemma 13. Let (X^i, ^^ 2 ), for A; = 1,..., n, he a random sample from 

F = {1- 7 )$p + 7 iL, 

where is a bivariate normal distribution with correlation p, and H is an arbitrary bivariate 
distribution. Let r^ be the Spearman’s rho statistic, and suppose the samples {X^i : k = 1,... ,n} 
are unique. Then 

Epir^) = ^ sin"^ (^0 + R, 

where |i?| < 487 + 1297^ + 887 ^ + 

Proof. Let p^ = Ep{r^) be the population version of Kendall’s tau correlation. By Lemma 14, we 
have 

Ep{r^) = . E [sign(Xii - X 2 i)sign(Xi 2 - X 32 )] 4- 

Tlj 1“ J. T~L i~ X 

= 3K [sign(Xii - X 2 i)sign(Xi 2 - X 32 )] 

+ ^^^^( 7 *^ - SK [sign(Xii - X 2 i)sign(Xi 2 - -^ 32 )])- (67) 

Note that the second term is clearly bounded in magnitude by Now define a{X) = \{X > 0), 
and let sign(X) = 2a(X) — 1. Then sign(X) = sign(X) — 1{X = 0). It follows that 

E [sign(Xii - X 2 i)sign(Xi 2 - X 32 )] 

= E [^(Xn - X 2 i)^(Xi 2 - X 32 )] - E[l{Xii = X 2 i)^(Xi 2 - X 32 )] 

- E [^(Xn - X 2 i)l(Xi 2 = X 32 )] + E [l(Xn = X 2 i)l(Xi 2 = X 32 )] 

■.= A + B + C + D. 

A similar argument as in the proof of Lemma 12 yields 

max{|B|, |C|, |L»|} < 7 ^, ( 68 ) 

and 

A = (4L;[o(Xii - X 2 i)a(Xi 2 - X 32 )] - l) + 2(l - E[a{Xii - X 21 )] - E[a{Xi 2 - X 32 )]) 

:= Ai + A 2 . 
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Here, the expectation is with respect to the joint distribution of (Xu, X 12 , X 21 , X 22 , X 31 , X 32 ), with 
density 


/ = [(!- ^)(j)^ + ^hi] [(1 - 7)(?i2 + 7 ^ 2 ] [(1 - 7)</’3 + 7 ^ 3 ] 

= (1 - 7)^(/>1(/>2</>3 + 7(1 - l f[4>l(t)2h2, + 4>l4>3h2 + 4'24>3hl] 

+ 7^(1 - 7)[0i/i2/i3 + 4>2hih3 + (I)3hih2\ + 'y^hih2h3. (69) 


Now let U = Xii—X 21 and V = X 12 —X 32 . Under the product distribution (j)i4>24>3^ the distribution 
of {U,V) is bivariate normal with mean 0 and correlation p/ 2 . Hence, 


[a(C/)] — -U(j(,j(^203 [®'(^)] ~ 2’ 


(70) 


and by Lemma 15, 




1 

4 


1 + 


— sin 
vr 



(71) 


Combining equations (69) and (70), and noting that E[a{U)] is between 0 and 1, we then have 


Ef[a{U)] 

= (1 ~ 7 ) [®(^)] + 7(1 ~ 7 ) {-®</>l<i>2^3 + -®</>l<i>3^2 + -E'</>2<i>3^1 [®(^)] } 

+ - 'y){E^^h2h3HU)] + E^^hihaHU)] + E^^h3h2HU)]} + 'y^Eh,h2h3HU)] 

= - - -7 + 2 "!^ “ 2 ^^ ~ {^’t>\(l>2h3[<^{E)\ + £'</,3 03/i2[a(C/)] + £'</,203hi[a(U)]} 

+ 7 (t “ 7 ) {-^0l/l2^3 [®(^)] E(P2hih3 [®(U)] + i7(^3/ij/i2 [®(^)] } U 7 -®/ll/l2/l3 [®(^)] 

= ^ + C 7 + d'j'^ + 67 ^, 


where |c| < |, |(i| < |, and |e| < It follows that 


Ef[a{U)] - - 


3 9 0 7 0 

^ 2^ + 2^ + 2^ 


and 


Ef[a{V)] - - 


3 9 2 7 o 

^ 2^ + 2^ + 2^ 


so 


IH2I = 2|1 — Ef[a{U)] — Ef[a{V)]\ < 67 + 187^ + 147^. 


(72) 
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Furthermore, combining equations (69) and (71), we have 
= AEf[a{U)a{V)] - 1 
= 4|(l-7)3F;^,^,^3[a(t/)a(y)] 

+ 'y{^-'yf{E^i<t>2h3HU)a{V)] + E^^^^h^[a{U)a{V)] + E^^^^hAa{U)a{V)]} 
+ 'y^il - 'y){E^^h2h3[a{U)a{V)] + E^^hrh3[a{U)a{V)] + E^^hih2[aiU)a(y)]} 

+ ^^E^,h2h3HU)aiV)]]-l 


= (1-7)' 


2 -T / P 

1 H— sin ' 

TT 


- 1 


+ 4|7(1 - 'y?{E4,3ct>2h3[aiU)a{V)] + [a(t/)a(l/)] + E^^^^h3HU)a{V)]} 

+ - -f){E^^h2h3[a{U)a{V)] + E^2hih3[a{U)a{V)] + E^^hih2[a{U)a{V)]} 

+ l^Eh^h^h3[a{U)a{V)] 


^ ^ ( 2 ) 


TT 


2 -A f P 
1 H— sin ' 


TT 


+ 4 | 7(1 - if {E<t>3ci>2h3[a.{U)a{V)] + E^^^^h2[a.{U)a{V)] + E^^^^h3[a.{U)a{V)]] 

+ I'^O- - l){E<i>3h2h3[a{U)a{V)] + E^^h3h3[a{U)a{V)]+ E^^h3h2[a{U)a{V)]] 
+ 'y^Eh^h 2 h 3 [a{U)a{V)] 


^ ■ -1 f P 
= — sm ' 

TT 


( 2 ) + <^^7 + + eV, 


where \c'\ < 10, \d'\ < 22, and |e'| < Hence, we obtain 


Ai -sin 


-1 (P 


TT 


9 46 o 

< IO7 + 2272 + —73. 

O 


(73) 


Combining inequalities (67), (68), (72) and (73), we then obtain 
6 . -I fP 




< 3 

< 3 


Hi + H 2 + H + C + H - - sin ^ 

TT 


+ 


12 


n + 1 


A ■ -1 ( P 

Ai -sm ' 

TT 


< 487 + 1297 ^ + 8873 + 


The following lemma comes from Hoeffding (1948): 


+ IH 2 I + \B\ + \C\ + |Z1| > + 


12 


12 

n + 1 


n + 1 


□ 


Lemma 14. Suppose the samples {X^t : k = 1,... ,n} are unique, for i = 1,2. The Spearman’s 
rho correlation can be decomposed as 


r^ = 


n — 2 
n + 1 


U + 


„K 


n + 1 
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where is the Kendall’s tau correlation, and U is a U-statistic of order 3 with corresponding 
symmetric kernel 

'ifu{Xi,X2,X^) = ^ ^ 3 • sign(Xi^i - Xi^i) sign(Xi^2 - 

(*l,i2,*3)Sperm(l,2,3) 


and the summation is taken over all possible permutations of the three arguments. 

The proof of the following lemma is adapted from an argument in Croux and Dehon (2010). 

Lemma 15. Suppose {X, Y) follows a bivariate normal distribution with mean 0 and correlation 
p. Then 


E[a{X)a{Y)] = P{X > 0,y > 0) 


1 

4 


1 + 


— sin ^{p) 

TT 


Proof. Recall that we may write 

Y = pX + Vl - p^Z, 

where {X, Z) ~ N(0,l2). Furthermore, we have the polar coordinate representation 

{X,Z) = {Rcos 9, Rsine), 

where 0 ~ Uniform(—vr, vr], and R follows a Rayleigh distribution. Then 

Y = R (^pcos{0) + \/l — p'^ sin(0)^ , 


which has the convenient representation Y 


iisin(Q! + 9), where a = sin ^{p). It follows that 


P{X > 0,Y > 0) = P{cos 9 > 0, sin(a + 0) > 0) 


(e G 

r 

\ 5 + a 

1 

2 ; 
1 + - sin ^{p) 

“ TT 

- 


V 

L 2J 

J 2t: 

4 

TT 


□ 


Finally, we have a simple lemma concerning the Lipschitz behavior of the normal quantile 
function: 

Lemma 16. The standard normal quantile function : [0, 1] —>■ M, when restricted to the domain 
[0.2,0.8], is Lipschitz continuous with Lipschitz constant 3.6; i.e., 

|$“^(o)-$“^(6)1 <3.6|o-6|, Vo,6g [0.2,0.8]. 

Proof. It suffices to check that < 3.6, for y G [0.2,0.8]. Since [4>“^]'(4>(x)) • <h'(x) = 

= 1, we have 

[$"^]'($(x)) = Vx G M. 

For y = <h(x) G [0.2,0.8], we have x G [—0.8416,0.8416], and for such x’s, 

[<j)~i]'($(x)) = ^ = \/^exp ^2^^^ — ^2 ' 

This concludes the proof. □ 
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